jorisvandenbossche commented on pull request #7536:
URL: https://github.com/apache/arrow/pull/7536#issuecomment-652493993
@bkietz thanks for the update ensuring all uniques as dictionary values!
Testing this out, I ran into an issue with HivePartitioning -> ARROW-9288 /
#7608
Further, a usability issue: this now creates partition expressions that use
a dictionary type. Which means that doing something like
`dataset.to_table(filter=ds.field("part") == "A")` to filter on the partition
field with a plain string expression doesn't work, limiting the usability of
this option (and even with the new Python scalar stuff, it would not be easy to
construct the correct expression):
```
In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)
In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet",
partitioning=part)
In [11]: fragment = list(dataset.get_fragments())[0]
In [12]: fragment.partition_expression
Out[12]:
<pyarrow.dataset.Expression (part == [
"A",
"B"
][0]:dictionary<values=string, indices=int32, ordered=0>)>
In [13]: dataset.to_table(filter=ds.field("part") == "A")
...
ArrowNotImplementedError: cast from string
```
It might also be an option to keep the `partition_expression` use the
dictionary *value type* instead of dictionary type?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]