[ 
https://issues.apache.org/jira/browse/ARROW-13487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-13487:
---------------------------------
    Description: 
While round tripping dictionary-encoded arrays in dictionary-encoded parquet 
files in arrow2, I have been unable to have pyarrow read all values from the 
dictionary page. This contrasts with (py)spark, that can read them.

Attached to this issue is a parquet file generated from rust's arrow2 whereby I 
read the IPC "generated_dictionary" file and write it into parquet (v1) with 
dictionary-encoding. I.e. 2 pages, one with the values, the other with the 
indices.

The expected result for the column 0, "dict0" is
{code:python}
import pyarrow

path = "generated_dictionary"
golden_path = 
f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{path}.arrow_file"
column = ("dict0", 0)

table = pyarrow.ipc.RecordBatchFileReader(golden_path).read_all()
expected = next(c for i, c in enumerate(expected) if i == column[1])
expected = expected.combine_chunks().tolist()
print(expected)
# ['nwg€6d€', None, None, None, None, None, None, None, None, 'e£a5µ矢a', None, 
None, 'rpc£µ£3', None, None, None, None]


# read with pyspark
spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

df = spark.read.parquet(f"{golden_path}.parquet")

r = df.select(column[0]).collect()

result = [r[column[0]] for r in r]
assert expected == result
{code}
However, I have been unable to correctly read it from pyarrow. The result I get:
{code:python}
table = pq.read_table(f"{path}.parquet")
result = table[0]
print(result.combine_chunks().dictionary)
print(result.combine_chunks().indices)
[
  "2lf4µµr",
  "",
  "nwg€6d€",
  "rpc£µ£3",
  "e£a5µ矢a"
]
[
  2,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  8,
  null,
  null,
  4,
  null,
  null,
  null,
  null
]
{code}
which is incorrect as the largest index (8) is larger than the len (5) of the 
values.

The indices are being read correctly, but not all values are. For clarity, the 
buffer in the dictionary page (PLAIN-encoded as per spec) on the attached 
parquet is:
{code:python}
# ["2lf4µµr", "", "nwg€6d€", "", "rpc£µ£3", "", "", "", "e£a5µ矢a", ""]

[
9, 0, 0, 0, 50, 108, 102, 52, 194, 181, 194, 181, 114,
0, 0, 0, 0, 
11, 0, 0, 0, 110, 119, 103, 226, 130, 172, 54, 100, 226, 130, 172, 
0, 0, 0, 0,
10, 0, 0, 0, 114, 112, 99, 194, 163, 194, 181, 194, 163, 51, 
0, 0, 0, 0, 
0, 0, 0, 0, 
0, 0, 0, 0, 
11, 0, 0, 0, 101, 194, 163, 97, 53, 194, 181, 231, 159, 162, 97, 
0, 0, 0, 0
]
{code}
and the reported number of values in the dict page header is 10. I would expect 
all values to be read directly to the dictionary.

We cannot discard the possibility that I am doing something wrong in writing. 
So far I was able to round-trip these within arrow2 and can read dict-encoded 
from both pyarrow and pyspark, which suggests that the arrow2 reader is correct.

  was:
While round tripping dictionary-encoded arrays in dictionary-encoded parquet 
files in arrow2, I have been unable to have pyarrow read all values from the 
dictionary page. This contrasts with (py)spark, that can read them.

Attached to this issue is a parquet file generated from rust's arrow2 whereby I 
read the IPC "generated_dictionary" file and write it into parquet (v1) with 
dictionary-encoding. I.e. 2 pages, one with the values, the other with the 
indices.

The expected result for the column 0, "dict0" is
{code:python}
import pyarrow

path = "generated_dictionary"
golden_path = 
f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{path}.arrow_file"
column = ("dict0", 0)

table = pyarrow.ipc.RecordBatchFileReader(golden_path).read_all()
expected = next(c for i, c in enumerate(expected) if i == column[1])
expected = expected.combine_chunks().tolist()
print(expected)
# ['nwg€6d€', None, None, None, None, None, None, None, None, 'e£a5µ矢a', None, 
None, 'rpc£µ£3', None, None, None, None]


# read with pyspark
spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

df = spark.read.parquet(f"{path}.parquet")

r = df.select(column[0]).collect()

result = [r[column[0]] for r in r]
assert expected == result
{code}
However, I have been unable to correctly read it from pyarrow. The result I get:
{code:python}
table = pq.read_table(f"{path}.parquet")
result = table[0]
print(result.combine_chunks().dictionary)
print(result.combine_chunks().indices)
[
  "2lf4µµr",
  "",
  "nwg€6d€",
  "rpc£µ£3",
  "e£a5µ矢a"
]
[
  2,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  8,
  null,
  null,
  4,
  null,
  null,
  null,
  null
]
{code}
which is incorrect as the largest index (8) is larger than the len (5) of the 
values.

The indices are being read correctly, but not all values are. For clarity, the 
buffer in the dictionary page (PLAIN-encoded as per spec) on the attached 
parquet is:
{code:python}
# ["2lf4µµr", "", "nwg€6d€", "", "rpc£µ£3", "", "", "", "e£a5µ矢a", ""]

[
9, 0, 0, 0, 50, 108, 102, 52, 194, 181, 194, 181, 114,
0, 0, 0, 0, 
11, 0, 0, 0, 110, 119, 103, 226, 130, 172, 54, 100, 226, 130, 172, 
0, 0, 0, 0,
10, 0, 0, 0, 114, 112, 99, 194, 163, 194, 181, 194, 163, 51, 
0, 0, 0, 0, 
0, 0, 0, 0, 
0, 0, 0, 0, 
11, 0, 0, 0, 101, 194, 163, 97, 53, 194, 181, 231, 159, 162, 97, 
0, 0, 0, 0
]
{code}
and the reported number of values in the dict page header is 10. I would expect 
all values to be read directly to the dictionary.

We cannot discard the possibility that I am doing something wrong in writing. 
So far I was able to round-trip these within arrow2 and can read dict-encoded 
from both pyarrow and pyspark, which suggests that the arrow2 reader is correct.


> [C++][Parquet] Reading dict pages is not reading all values?
> ------------------------------------------------------------
>
>                 Key: ARROW-13487
>                 URL: https://issues.apache.org/jira/browse/ARROW-13487
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Parquet
>            Reporter: Jorge Leitão
>            Priority: Major
>         Attachments: generated_dictionary.parquet
>
>
> While round tripping dictionary-encoded arrays in dictionary-encoded parquet 
> files in arrow2, I have been unable to have pyarrow read all values from the 
> dictionary page. This contrasts with (py)spark, that can read them.
> Attached to this issue is a parquet file generated from rust's arrow2 whereby 
> I read the IPC "generated_dictionary" file and write it into parquet (v1) 
> with dictionary-encoding. I.e. 2 pages, one with the values, the other with 
> the indices.
> The expected result for the column 0, "dict0" is
> {code:python}
> import pyarrow
> path = "generated_dictionary"
> golden_path = 
> f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{path}.arrow_file"
> column = ("dict0", 0)
> table = pyarrow.ipc.RecordBatchFileReader(golden_path).read_all()
> expected = next(c for i, c in enumerate(expected) if i == column[1])
> expected = expected.combine_chunks().tolist()
> print(expected)
> # ['nwg€6d€', None, None, None, None, None, None, None, None, 'e£a5µ矢a', 
> None, None, 'rpc£µ£3', None, None, None, None]
> # read with pyspark
> spark = pyspark.sql.SparkSession.builder.config(
>     # see https://stackoverflow.com/a/62024670/931303
>     "spark.sql.parquet.enableVectorizedReader",
>     "false",
> ).getOrCreate()
> df = spark.read.parquet(f"{golden_path}.parquet")
> r = df.select(column[0]).collect()
> result = [r[column[0]] for r in r]
> assert expected == result
> {code}
> However, I have been unable to correctly read it from pyarrow. The result I 
> get:
> {code:python}
> table = pq.read_table(f"{path}.parquet")
> result = table[0]
> print(result.combine_chunks().dictionary)
> print(result.combine_chunks().indices)
> [
>   "2lf4µµr",
>   "",
>   "nwg€6d€",
>   "rpc£µ£3",
>   "e£a5µ矢a"
> ]
> [
>   2,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   null,
>   8,
>   null,
>   null,
>   4,
>   null,
>   null,
>   null,
>   null
> ]
> {code}
> which is incorrect as the largest index (8) is larger than the len (5) of the 
> values.
> The indices are being read correctly, but not all values are. For clarity, 
> the buffer in the dictionary page (PLAIN-encoded as per spec) on the attached 
> parquet is:
> {code:python}
> # ["2lf4µµr", "", "nwg€6d€", "", "rpc£µ£3", "", "", "", "e£a5µ矢a", ""]
> [
> 9, 0, 0, 0, 50, 108, 102, 52, 194, 181, 194, 181, 114,
> 0, 0, 0, 0, 
> 11, 0, 0, 0, 110, 119, 103, 226, 130, 172, 54, 100, 226, 130, 172, 
> 0, 0, 0, 0,
> 10, 0, 0, 0, 114, 112, 99, 194, 163, 194, 181, 194, 163, 51, 
> 0, 0, 0, 0, 
> 0, 0, 0, 0, 
> 0, 0, 0, 0, 
> 11, 0, 0, 0, 101, 194, 163, 97, 53, 194, 181, 231, 159, 162, 97, 
> 0, 0, 0, 0
> ]
> {code}
> and the reported number of values in the dict page header is 10. I would 
> expect all values to be read directly to the dictionary.
> We cannot discard the possibility that I am doing something wrong in writing. 
> So far I was able to round-trip these within arrow2 and can read dict-encoded 
> from both pyarrow and pyspark, which suggests that the arrow2 reader is 
> correct.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to