[ 
https://issues.apache.org/jira/browse/ARROW-9345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9345:
-----------------------------------------
    Description: 
Related to ARROW-8647, see comment at 
https://github.com/apache/arrow/pull/7536#issuecomment-653124260

When using dictionary type for the partition fields, this now creates partition 
expressions that also use a dictionary type. Which means that doing something 
like {{dataset.to_table(filter=ds.field("part") == "A")}} to filter on the 
partition field with a plain string expression doesn't work, limiting the 
usability of this option (and even with the new Python scalar stuff, it would 
not be easy to construct the correct expression):

{code}
In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)  

In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", 
partitioning=part)

In [11]: fragment = list(dataset.get_fragments())[0]   

In [12]: fragment.partition_expression  
Out[12]: 
<pyarrow.dataset.Expression (part == [
  "A",
  "B"
][0]:dictionary<values=string, indices=int32, ordered=0>)>

In [13]: dataset.to_table(filter=ds.field("part") == "A") 
...
ArrowNotImplementedError: cast from string
{code}

It might be an option to keep the {{partition_expression}} use the dictionary 
_value type_ instead of dictionary type? Or alternatively, as [~fsaintjacques] 
proposed, ensure that any comparison involving the dict type should also work 
with the "effective" logical type (the value type of the dict).

  was:
Related to ARROW-8647, see comment at 
https://github.com/apache/arrow/pull/7536#issuecomment-653124260

When using dictionary type for the partition fields, this now creates partition 
expressions that also use a dictionary type. Which means that doing something 
like {{dataset.to_table(filter=ds.field("part") == "A")}} to filter on the 
partition field with a plain string expression doesn't work, limiting the 
usability of this option (and even with the new Python scalar stuff, it would 
not be easy to construct the correct expression):

{code}
In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)  

In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", 
partitioning=part)

In [11]: fragment = list(dataset.get_fragments())[0]   

In [12]: fragment.partition_expression  
Out[12]: 
<pyarrow.dataset.Expression (part == [
  "A",
  "B"
][0]:dictionary<values=string, indices=int32, ordered=0>)>

In [13]: dataset.to_table(filter=ds.field("part") == "A") 
...
ArrowNotImplementedError: cast from string
{code}

It might be an option to keep the `partition_expression` use the dictionary 
*value type* instead of dictionary type? Or alternatively, as [~fsaintjacques] 
proposed, ensure that any comparison involving the dict type should also work 
with the "effective" logical type (the value type of the dict).


> [C++][Dataset] Expression with dictionary type should work with operand of 
> value type 
> --------------------------------------------------------------------------------------
>
>                 Key: ARROW-9345
>                 URL: https://issues.apache.org/jira/browse/ARROW-9345
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> Related to ARROW-8647, see comment at 
> https://github.com/apache/arrow/pull/7536#issuecomment-653124260
> When using dictionary type for the partition fields, this now creates 
> partition expressions that also use a dictionary type. Which means that doing 
> something like {{dataset.to_table(filter=ds.field("part") == "A")}} to filter 
> on the partition field with a plain string expression doesn't work, limiting 
> the usability of this option (and even with the new Python scalar stuff, it 
> would not be easy to construct the correct expression):
> {code}
> In [9]: part = ds.HivePartitioning.discover(max_partition_dictionary_size=2)  
> In [10]: dataset = ds.dataset("test_partitioned_filter/", format="parquet", 
> partitioning=part)
> In [11]: fragment = list(dataset.get_fragments())[0]   
> In [12]: fragment.partition_expression  
> Out[12]: 
> <pyarrow.dataset.Expression (part == [
>   "A",
>   "B"
> ][0]:dictionary<values=string, indices=int32, ordered=0>)>
> In [13]: dataset.to_table(filter=ds.field("part") == "A") 
> ...
> ArrowNotImplementedError: cast from string
> {code}
> It might be an option to keep the {{partition_expression}} use the dictionary 
> _value type_ instead of dictionary type? Or alternatively, as 
> [~fsaintjacques] proposed, ensure that any comparison involving the dict type 
> should also work with the "effective" logical type (the value type of the 
> dict).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to