jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658720185
A bit simplified example:
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
foo_keys = np.array([0, 1, 3])
bar_keys = np.array(['a', 'b', 'c'], dtype=object)
N = 30
table = pa.table({
'foo': foo_keys.repeat(10),
'bar': np.tile(np.tile(bar_keys, 5), 2),
'values': np.random.randn(N)
})
base_path = "test_partition_directories3"
pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"])
# works
ds.dataset(base_path, partitioning="hive")
# fails
part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
ds.dataset(base_path, partitioning=part)
```
this also fails, with "ArrowInvalid: No dictionary provided for dictionary
field bar: dictionary<values=string, indices=int32, ordered=0>" (so slightly
different error message)
From playing with different keys for foo/bar, it seems that it might be
trying to use the dictionary of the first field to parse the values of the
second field (this might be a bug in my fix for HivePartitioning).
Because replacing the keys with:
```python
foo_keys = np.array(['a', 'b', 'c'], dtype=object)
bar_keys = np.array(['a', 'b', 'c'], dtype=object)
```
works, while this
```python
foo_keys = np.array(['a', 'b', 'c'], dtype=object)
bar_keys = np.array(['e', 'f', 'g'], dtype=object)
```
fails with "Dictionary supplied for field bar: dictionary<values=string,
indices=int32, ordered=0> does not contain 'e'"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]