jorisvandenbossche commented on pull request #7545: URL: https://github.com/apache/arrow/pull/7545#issuecomment-658720185
A bit simplified example: ```python import numpy as np import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds foo_keys = np.array([0, 1, 3]) bar_keys = np.array(['a', 'b', 'c'], dtype=object) N = 30 table = pa.table({ 'foo': foo_keys.repeat(10), 'bar': np.tile(np.tile(bar_keys, 5), 2), 'values': np.random.randn(N) }) base_path = "test_partition_directories3" pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"]) # works ds.dataset(base_path, partitioning="hive") # fails part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1) ds.dataset(base_path, partitioning=part) ``` this also fails, with "ArrowInvalid: No dictionary provided for dictionary field bar: dictionary<values=string, indices=int32, ordered=0>" (so slightly different error message) From playing with different keys for foo/bar, it seems that it might be trying to use the dictionary of the first field to parse the values of the second field (this might be a bug in my fix for HivePartitioning). Because replacing the keys with: ```python foo_keys = np.array(['a', 'b', 'c'], dtype=object) bar_keys = np.array(['a', 'b', 'c'], dtype=object) ``` works, while this ```python foo_keys = np.array(['a', 'b', 'c'], dtype=object) bar_keys = np.array(['e', 'f', 'g'], dtype=object) ``` fails with "Dictionary supplied for field bar: dictionary<values=string, indices=int32, ordered=0> does not contain 'e'" ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org