[
https://issues.apache.org/jira/browse/ARROW-13487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404074#comment-17404074
]
Micah Kornfield commented on ARROW-13487:
-----------------------------------------
[https://github.com/apache/arrow/blob/f406b531c1e93e97b72b2ff725db05ee18ad32f2/cpp/src/parquet/encoding.cc#L1843]
is the problematic line. It appears the documentation is too specific but
what InsertMemoValues does appears to dedupe the values. So I think the two
options are:
1. Make that method act more like a multimap.
2. Bypass that method and construct a dictionary array without deduping.
[~wesm] originally wrote this code I think, and I seem to recall there being
some issues here so curious if he recalls any downstream issues of
not-dedupping.
> [C++][Parquet] Reading dict pages is not reading all values?
> ------------------------------------------------------------
>
> Key: ARROW-13487
> URL: https://issues.apache.org/jira/browse/ARROW-13487
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Parquet
> Reporter: Jorge Leitão
> Priority: Major
> Attachments: generated_dictionary.parquet
>
>
> While round tripping dictionary-encoded arrays in dictionary-encoded parquet
> files in arrow2, I have been unable to have pyarrow read all values from the
> dictionary page. This contrasts with (py)spark, that can read them.
> Attached to this issue is a parquet file generated from rust's arrow2 whereby
> I read the IPC "generated_dictionary" file and write it into parquet (v1)
> with dictionary-encoding. I.e. 2 pages, one with the values, the other with
> the indices.
> The expected result for the column 0, "dict0" is
> {code:python}
> import pyarrow
> path = "generated_dictionary"
> golden_path =
> f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{path}.arrow_file"
> column = ("dict0", 0)
> table = pyarrow.ipc.RecordBatchFileReader(golden_path).read_all()
> expected = next(c for i, c in enumerate(expected) if i == column[1])
> expected = expected.combine_chunks().tolist()
> print(expected)
> # ['nwg€6d€', None, None, None, None, None, None, None, None, 'e£a5µ矢a',
> None, None, 'rpc£µ£3', None, None, None, None]
> # read with pyspark
> spark = pyspark.sql.SparkSession.builder.config(
> # see https://stackoverflow.com/a/62024670/931303
> "spark.sql.parquet.enableVectorizedReader",
> "false",
> ).getOrCreate()
> df = spark.read.parquet(f"{golden_path}.parquet")
> r = df.select(column[0]).collect()
> result = [r[column[0]] for r in r]
> assert expected == result
> {code}
> However, I have been unable to correctly read it from pyarrow. The result I
> get:
> {code:python}
> table = pq.read_table(f"{path}.parquet")
> result = table[0]
> print(result.combine_chunks().dictionary)
> print(result.combine_chunks().indices)
> [
> "2lf4µµr",
> "",
> "nwg€6d€",
> "rpc£µ£3",
> "e£a5µ矢a"
> ]
> [
> 2,
> null,
> null,
> null,
> null,
> null,
> null,
> null,
> null,
> 8,
> null,
> null,
> 4,
> null,
> null,
> null,
> null
> ]
> {code}
> which is incorrect as the largest index (8) is larger than the len (5) of the
> values.
> The indices are being read correctly, but not all values are. For clarity,
> the buffer in the dictionary page (PLAIN-encoded as per spec) on the attached
> parquet is:
> {code:python}
> # ["2lf4µµr", "", "nwg€6d€", "", "rpc£µ£3", "", "", "", "e£a5µ矢a", ""]
> [
> 9, 0, 0, 0, 50, 108, 102, 52, 194, 181, 194, 181, 114,
> 0, 0, 0, 0,
> 11, 0, 0, 0, 110, 119, 103, 226, 130, 172, 54, 100, 226, 130, 172,
> 0, 0, 0, 0,
> 10, 0, 0, 0, 114, 112, 99, 194, 163, 194, 181, 194, 163, 51,
> 0, 0, 0, 0,
> 0, 0, 0, 0,
> 0, 0, 0, 0,
> 11, 0, 0, 0, 101, 194, 163, 97, 53, 194, 181, 231, 159, 162, 97,
> 0, 0, 0, 0
> ]
> {code}
> and the reported number of values in the dict page header is 10. I would
> expect all values to be read directly to the dictionary.
> We cannot discard the possibility that I am doing something wrong in writing.
> So far I was able to round-trip these within arrow2 and can read dict-encoded
> from both pyarrow and pyspark, which suggests that the arrow2 reader is
> correct.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)