[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

GitBox Wed, 01 Jul 2020 08:40:03 -0700


jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-652493993



   @bkietz thanks for the update ensuring all uniques as dictionary values!
   
   Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 / 
#7608
   
   Further, a usability issue: this now creates partition expressions that use 
a dictionary type. Which means that doing something like 
`dataset.to_table(filter=ds.field("part") == "A")` to filter on the partition 
field with a plain string expression doesn't work, limiting the usability of 
this option (and even with the new Python scalar stuff, it would not be easy to 
construct the correct expression):
   
   ```
   In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2) 
 
   
   In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", 
partitioning=part)
   
   In [11]: fragment = list(dataset.get_fragments())[0]   
   
   In [12]: fragment.partition_expression  
   Out[12]: 
   <pyarrow.dataset.Expression (part == [
     "A",
     "B"
   ][0]:dictionary<values=string, indices=int32, ordered=0>)>
   
   In [13]: dataset.to_table(filter=ds.field("part") == "A") 
   ...
   ArrowNotImplementedError: cast from string
   ```
   
   It might also be an option to keep the `partition_expression` use the 
dictionary *value type* instead of dictionary type?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on pull request #7536: ARROW-8647: [C++][Python][Dataset] Allow partitioning fields to be inferred with dictionary type

Reply via email to