[jira] [Updated] (ARROW-9105) [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field

Joris Van den Bossche (Jira) Thu, 11 Jun 2020 07:28:00 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joris Van den Bossche updated ARROW-9105:
-----------------------------------------
    Description: 
When splitting a fragment into row group fragments, filtering on the partition 
field raises an error.

Python reproducer:

{code:python}
df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]})
df.to_parquet("test_partitioned_filter", partition_cols="part", 
engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_partitioned_filter", format="parquet", 
partitioning="hive")
fragment = list(dataset.get_fragments())[0]
{code}

{code}
In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas()           
                                                                                
                                                   
Out[31]: 
   dummy part
0      1    A
1      1    A

In [32]: fragment.split_by_row_group(ds.field("part") == "A")                   
                                                                                
                                                   
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-32-371cba80fd6f> in <module>
----> 1 fragment.split_by_row_group(ds.field("part") == "A")

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.ParquetFileFragment.split_by_row_group()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset._insert_implicit_casts()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Field named 'part' not found or not unique in the schema.
{code}

This is probably a "strange" thing to do, since the fragment from a partitioned 
dataset is already coming only from a single partition (so will always only 
satisfy a single equality expression). But it's still nice that as a user you 
don't have to care about only passing part of the filter down to 
{{split_by_row_groups}}.


  was:
When splitting a fragment into row group fragments, filtering on the partition 
field raises an error.

Python reproducer:

```
df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]})
df.to_parquet("test_partitioned_filter", partition_cols="part", 
engine="pyarrow")

import pyarrow.dataset as ds
dataset = ds.dataset("test_partitioned_filter", format="parquet", 
partitioning="hive")
fragment = list(dataset.get_fragments())[0]
```

```
In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas()           
                                                                                
                                                   
Out[31]: 
   dummy part
0      1    A
1      1    A

In [32]: fragment.split_by_row_group(ds.field("part") == "A")                   
                                                                                
                                                   
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-32-371cba80fd6f> in <module>
----> 1 fragment.split_by_row_group(ds.field("part") == "A")

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset.ParquetFileFragment.split_by_row_group()

~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
pyarrow._dataset._insert_implicit_casts()

~/scipy/repos/arrow/python/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Field named 'part' not found or not unique in the schema.
```

This is probably a "strange" thing to do, since the fragment from a partitioned 
dataset is already coming only from a single partition (so will always only 
satisfy a single equality expression). But it's still nice that as a user you 
don't have to care about only passing part of the filter down to 
{{split_by_row_groups}}.



> [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition 
> field
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-9105
>                 URL: https://issues.apache.org/jira/browse/ARROW-9105
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset, dataset-dask-integration
>             Fix For: 1.0.0
>
>
> When splitting a fragment into row group fragments, filtering on the 
> partition field raises an error.
> Python reproducer:
> {code:python}
> df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]})
> df.to_parquet("test_partitioned_filter", partition_cols="part", 
> engine="pyarrow")
> import pyarrow.dataset as ds
> dataset = ds.dataset("test_partitioned_filter", format="parquet", 
> partitioning="hive")
> fragment = list(dataset.get_fragments())[0]
> {code}
> {code}
> In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas()         
>                                                                               
>                                                        
> Out[31]: 
>    dummy part
> 0      1    A
> 1      1    A
> In [32]: fragment.split_by_row_group(ds.field("part") == "A")                 
>                                                                               
>                                                        
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> <ipython-input-32-371cba80fd6f> in <module>
> ----> 1 fragment.split_by_row_group(ds.field("part") == "A")
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset.ParquetFileFragment.split_by_row_group()
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._insert_implicit_casts()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in 
> pyarrow.lib.pyarrow_internal_check_status()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: Field named 'part' not found or not unique in the schema.
> {code}
> This is probably a "strange" thing to do, since the fragment from a 
> partitioned dataset is already coming only from a single partition (so will 
> always only satisfy a single equality expression). But it's still nice that 
> as a user you don't have to care about only passing part of the filter down 
> to {{split_by_row_groups}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-9105) [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field

Reply via email to