Joris Van den Bossche created ARROW-9345:
--------------------------------------------
Summary: [C++][Dataset] Expression with dictionary type should
work with operand of value type
Key: ARROW-9345
URL: https://issues.apache.org/jira/browse/ARROW-9345
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Joris Van den Bossche
Related to ARROW-8647, see comment at
https://github.com/apache/arrow/pull/7536#issuecomment-653124260
When using dictionary type for the partition fields, this now creates partition
expressions that also use a dictionary type. Which means that doing something
like {{dataset.to_table(filter=ds.field("part") == "A")}} to filter on the
partition field with a plain string expression doesn't work, limiting the
usability of this option (and even with the new Python scalar stuff, it would
not be easy to construct the correct expression):
{code}
In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)
In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet",
partitioning=part)
In [11]: fragment = list(dataset.get_fragments())[0]
In [12]: fragment.partition_expression
Out[12]:
<pyarrow.dataset.Expression (part == [
"A",
"B"
][0]:dictionary<values=string, indices=int32, ordered=0>)>
In [13]: dataset.to_table(filter=ds.field("part") == "A")
...
ArrowNotImplementedError: cast from string
{code}
It might be an option to keep the `partition_expression` use the dictionary
*value type* instead of dictionary type? Or alternatively, as [~fsaintjacques]
proposed, ensure that any comparison involving the dict type should also work
with the "effective" logical type (the value type of the dict).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)