[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

GitBox Wed, 15 Jul 2020 04:44:49 -0700


jorisvandenbossche commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658720185



   A bit simplified example:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds 
   
   foo_keys = np.array([0, 1, 3])
   bar_keys = np.array(['a', 'b', 'c'], dtype=object)
   N = 30
   
   table = pa.table({
       'foo': foo_keys.repeat(10),
       'bar': np.tile(np.tile(bar_keys, 5), 2),
       'values': np.random.randn(N)
   })
   
   base_path = "test_partition_directories3"
   pq.write_to_dataset(table, base_path, partition_cols=["bar", "foo"])
   
   # works
   ds.dataset(base_path, partitioning="hive")
   # fails
   part = ds.HivePartitioning.discover(max_partition_dictionary_size=-1)
   ds.dataset(base_path, partitioning=part)
   ```
   
   this also fails, with "ArrowInvalid: No dictionary provided for dictionary 
field bar: dictionary<values=string, indices=int32, ordered=0>" (so slightly 
different error message)
   
   From playing with different keys for foo/bar, it seems that it might be 
trying to use the dictionary of the first field to parse the values of the 
second field (this might be a bug in my fix for HivePartitioning). 
   
   Because replacing the keys with:
   
   ```python
   foo_keys = np.array(['a', 'b', 'c'], dtype=object)
   bar_keys = np.array(['a', 'b', 'c'], dtype=object)
   ```
   
   works, while this
   
   ```python
   foo_keys = np.array(['a', 'b', 'c'], dtype=object) 
   bar_keys = np.array(['e', 'f', 'g'], dtype=object) 
   ```
   
   fails with "Dictionary supplied for field bar: dictionary<values=string, 
indices=int32, ordered=0> does not contain 'e'"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

Reply via email to