jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658714201
When enabling dictionary encoding for string partition fields, there are
actually a bunch of failing tests ..
Eg this one (based on `test_read_partitioned_directory`):
```python
import pandas as pd
import pyarrow as pa
import pyarrow.dataset as ds
foo_keys = [0, 1]
bar_keys = ['a', 'b', 'c']
partition_spec = [
['foo', foo_keys],
['bar', bar_keys]
]
N = 30
df = pd.DataFrame({
'index': np.arange(N),
'foo': np.array(foo_keys, dtype='i4').repeat(15),
'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2),
'values': np.random.randn(N)
}, columns=['index', 'foo', 'bar', 'values'])
from pyarrow.tests.test_parquet import _generate_partition_directories
fs = pa.filesystem.LocalFileSystem()
_generate_partition_directories(fs, "test_partition_directories",
partition_spec, df)
# works
ds.dataset("test_partition_directories/", partitioning="hive")
# fails
part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
ds.dataset("test_partition_directories/", partitioning=part)
```
fails with
```
ArrowInvalid: Dictionary supplied for field bar: dictionary<values=string,
indices=int32, ordered=0> does not contain 'c'
In ../src/arrow/dataset/partition.cc, line 55, code:
(_error_or_value13).status()
In ../src/arrow/dataset/discovery.cc, line 243, code:
(_error_or_value16).status()
```
Another reproducible example (based on
`test_write_to_dataset_with_partitions`) giving a similar error:
```python
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
output_df = pd.DataFrame({'group1': list('aaabbbbccc'),
'group2': list('eefeffgeee'),
'num': list(range(10)),
'nan': [np.nan] * 10,
'date': np.arange('2017-01-01', '2017-01-11',
dtype='datetime64[D]')})
cols = output_df.columns.tolist()
partition_by = ['group1', 'group2']
output_table = pa.Table.from_pandas(output_df, safe=False,
preserve_index=False)
filesystem = pa.filesystem.LocalFileSystem()
base_path = "test_partition_directories2/"
pq.write_to_dataset(output_table, base_path, partition_by,
filesystem=filesystem)
# works
ds.dataset("test_partition_directories2/", partitioning="hive")
# fails
part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
ds.dataset("test_partition_directories2/", partitioning=part)
```
I couldn't yet figure out what is the reason it is failing in those cases,
though.
I should have tested the dictionary encoding feature more thoroughly,
earlier, sorry about that.
But with the current state (unless someone can fix it today, but I don't
have much time), it seems the choice is quite simple: merge as is without
dictionary encoding, or delay until after 1.0
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]