[
https://issues.apache.org/jira/browse/ARROW-11400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324992#comment-17324992
]
Krisztian Szucs commented on ARROW-11400:
-----------------------------------------
The issue title and commit title often differ, but we include the jira title in
the changelog. We could update it to use the commit titles instead to better
describe the nature of the fix.
> [Python] Pickled ParquetFileFragment has invalid partition_expresion with
> dictionary type in pyarrow 2.0
> --------------------------------------------------------------------------------------------------------
>
> Key: ARROW-11400
> URL: https://issues.apache.org/jira/browse/ARROW-11400
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Minor
> Labels: dataset, pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> From https://github.com/dask/dask/pull/7066#issuecomment-767156623
> Simplified reproducer:
> {code:python}
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> table = pa.table({'part': ['A', 'B']*5, 'col': range(10)})
> pq.write_to_dataset(table, "test_partitioned_parquet",
> partition_cols=["part"])
> # with partitioning_kwargs = {} there is no error
> partitioning_kwargs = {"max_partition_dictionary_size": -1}
> dataset = ds.dataset(
> "test_partitioned_parquet/", format="parquet",
> partitioning=ds.HivePartitioning.discover( **partitioning_kwargs)
> )
> frag = list(dataset.get_fragments())[0]
> {code}
> Querying this fragment works fine, but after serialization/deserialization
> with pickle, it gives errors (and with the original data example I actually
> got a segfault as well):
> {code}
> In [16]: import pickle
> In [17]: frag2 = pickle.loads(pickle.dumps(frag))
> In [19]: frag2.partition_expression
> ...
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 16:
> invalid continuation byte
> In [20]: frag2.to_table(schema=schema, columns=columns)
> Out[20]:
> pyarrow.Table
> col: int64
> part: dictionary<values=string, indices=int32, ordered=0>
> In [21]: frag2.to_table(schema=schema, columns=columns).to_pandas()
> ...
> ~/miniconda3/envs/arrow-20/lib/python3.8/site-packages/pyarrow/table.pxi in
> pyarrow.lib.table_to_blocks()
> ArrowException: Unknown error: Wrapping ɻ� failed
> {code}
> It seems the issue was specifically with a partition expression with
> dictionary type.
> Also when using an integer columns as the partition column, you get wrong
> values (but silently in this case):
> {code:python}
> In [42]: frag.partition_expression
> Out[42]:
> <pyarrow.dataset.Expression (part == [
> 1,
> 2
> ][0]:dictionary<values=int32, indices=int32, ordered=0>)>
> In [43]: frag2.partition_expression
> Out[43]:
> <pyarrow.dataset.Expression (part == [
> 170145232,
> 32754
> ][0]:dictionary<values=int32, indices=int32, ordered=0>)>
> {code}
> Now, it seems this is fixed in master. But since I don't remember it was
> fixed intentionally ([~bkietz]?), it would be good to add some tests for it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)