[
https://issues.apache.org/jira/browse/ARROW-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-9345:
-----------------------------------------
Fix Version/s: 1.0.0
> [C++][Dataset] Expression with dictionary type should work with operand of
> value type
> --------------------------------------------------------------------------------------
>
> Key: ARROW-9345
> URL: https://issues.apache.org/jira/browse/ARROW-9345
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Assignee: Ben Kietzman
> Priority: Major
> Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> Related to ARROW-8647, see comment at
> https://github.com/apache/arrow/pull/7536#issuecomment-653124260
> When using dictionary type for the partition fields, this now creates
> partition expressions that also use a dictionary type. Which means that doing
> something like {{dataset.to_table(filter=ds.field("part") == "A")}} to filter
> on the partition field with a plain string expression doesn't work, limiting
> the usability of this option (and even with the new Python scalar stuff, it
> would not be easy to construct the correct expression):
> {code}
> In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)
> In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet",
> partitioning=part)
> In [11]: fragment = list(dataset.get_fragments())[0]
> In [12]: fragment.partition_expression
> Out[12]:
> <pyarrow.dataset.Expression (part == [
> "A",
> "B"
> ][0]:dictionary<values=string, indices=int32, ordered=0>)>
> In [13]: dataset.to_table(filter=ds.field("part") == "A")
> ...
> ArrowNotImplementedError: cast from string
> {code}
> It might be an option to keep the {{partition_expression}} use the dictionary
> _value type_ instead of dictionary type? Or alternatively, as
> [~fsaintjacques] proposed, ensure that any comparison involving the dict type
> should also work with the "effective" logical type (the value type of the
> dict).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)