[jira] [Commented] (ARROW-13487) [C++][Parquet] Reading dict pages is not reading all values?

Micah Kornfield (Jira) Tue, 24 Aug 2021 14:56:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404074#comment-17404074
 ]


Micah Kornfield commented on ARROW-13487:
-----------------------------------------

[https://github.com/apache/arrow/blob/f406b531c1e93e97b72b2ff725db05ee18ad32f2/cpp/src/parquet/encoding.cc#L1843]
 is the problematic line.  It appears the documentation is too specific but 
what InsertMemoValues does appears to dedupe the values.  So I think the two 
options are:

1.  Make that method act more like a multimap.

2.  Bypass that method and construct a dictionary array without deduping.

 

[~wesm] originally wrote this code I think, and I seem to recall there being 
some issues here so curious if he recalls any downstream issues of 
not-dedupping.

> [C++][Parquet] Reading dict pages is not reading all values?
> ------------------------------------------------------------
>
>                 Key: ARROW-13487
>                 URL: https://issues.apache.org/jira/browse/ARROW-13487
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Jorge Leitão
>            Priority: Major
>         Attachments: generated_dictionary.parquet
>
>
> While round tripping dictionary-encoded arrays in dictionary-encoded parquet 
> files in arrow2, I have been unable to have pyarrow read all values from the 
> dictionary page. This contrasts with (py)spark, that can read them.
> Attached to this issue is a parquet file generated from rust's arrow2 whereby 
> I read the IPC "generated_dictionary" file and write it into parquet (v1) 
> with dictionary-encoding. I.e. 2 pages, one with the values, the other with 
> the indices.
> The expected result for the column 0, "dict0" is
> {code:python}
> import pyarrow
> path = "generated_dictionary"
> golden_path = 
> f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{path}.arrow_file"
> column = ("dict0", 0)
> table = pyarrow.ipc.RecordBatchFileReader(golden_path).read_all()
> expected = next(c for i, c in enumerate(expected) if i == column[1])
> expected = expected.combine_chunks().tolist()
> print(expected)
> # ['nwg€6d€', None, None, None, None, None, None, None, None, 'e£a5µ矢a', 
> None, None, 'rpc£µ£3', None, None, None, None]
> # read with pyspark
> spark = pyspark.sql.SparkSession.builder.config(
>     # see https://stackoverflow.com/a/62024670/931303
>     "spark.sql.parquet.enableVectorizedReader",
>     "false",
> ).getOrCreate()
> df = spark.read.parquet(f"{golden_path}.parquet")
> r = df.select(column[0]).collect()
> result = [r[column[0]] for r in r]
> assert expected == result
> {code}
> However, I have been unable to correctly read it from pyarrow. The result I 
> get:
> {code:python}
> table = pq.read_table(f"{path}.parquet")
> result = table[0]
> print(result.combine_chunks().dictionary)
> print(result.combine_chunks().indices)
> [
>   "2lf4µµr",
>   "",
>   "nwg€6d€",
>   "rpc£µ£3",
>   "e£a5µ矢a"
> ]
> [
>   2,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   8,
>   null,
>   null,
>   4,
>   null,
>   null,
>   null,
>   null
> ]
> {code}
> which is incorrect as the largest index (8) is larger than the len (5) of the 
> values.
> The indices are being read correctly, but not all values are. For clarity, 
> the buffer in the dictionary page (PLAIN-encoded as per spec) on the attached 
> parquet is:
> {code:python}
> # ["2lf4µµr", "", "nwg€6d€", "", "rpc£µ£3", "", "", "", "e£a5µ矢a", ""]
> [
> 9, 0, 0, 0, 50, 108, 102, 52, 194, 181, 194, 181, 114,
> 0, 0, 0, 0, 
> 11, 0, 0, 0, 110, 119, 103, 226, 130, 172, 54, 100, 226, 130, 172, 
> 0, 0, 0, 0,
> 10, 0, 0, 0, 114, 112, 99, 194, 163, 194, 181, 194, 163, 51, 
> 0, 0, 0, 0, 
> 0, 0, 0, 0, 
> 0, 0, 0, 0, 
> 11, 0, 0, 0, 101, 194, 163, 97, 53, 194, 181, 231, 159, 162, 97, 
> 0, 0, 0, 0
> ]
> {code}
> and the reported number of values in the dict page header is 10. I would 
> expect all values to be read directly to the dictionary.
> We cannot discard the possibility that I am doing something wrong in writing. 
> So far I was able to round-trip these within arrow2 and can read dict-encoded 
> from both pyarrow and pyspark, which suggests that the arrow2 reader is 
> correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13487) [C++][Parquet] Reading dict pages is not reading all values?

Reply via email to