[ 
https://issues.apache.org/jira/browse/ARROW-10091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-10091:
------------------------------------------
    Description: 
Currently the {{isin}} filter works for partition-based filtering, but not for 
row group (statistics)-based filtering. 

Of course, for a partition-based expression like {{name == 'a'}} is it easier 
to check this, as for statistics (min/max) expressions like {{(name > 'a') & 
(name < 'a')}} we would need to check for the special case of min and max being 
equal (I think this is the only case we can say something with certainly for an 
"isin" expression?)

Code example:

{code:python}
>>> table = pa.table(\{"name": np.repeat(["a", "b", "c", "d"], 5), "value": 
>>> np.arange(20)\})
>>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
>>> dataset = ds.dataset("test_filter_string.parquet")# get the single file 
>>> fragment (dataset consists of one file)>>> fragment = 
>>> list(dataset.get_fragments())[0]
>>> fragment.ensure_complete_metadata()

# check that we do have statistics for our row groups
>>> fragment.row_groups[0].statistics
{'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}

# I created the file such that there are 4 row groups (each with a unique value 
in the name column)
>>> fragment.split_by_row_group()
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]

# simple equality filter works as expected -> only single row group left
>>> filter = ds.field("name") == "a"
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]

# isin filter does not work
>>> filter = ds.field("name").isin(["a", "b"])
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]
{code}

  was:
Currently the {{isin}} filter works for partition-based filtering, but not for 
row group (statistics)-based filtering. 

Of course, for a partition-based expression like {{name == 'a'}} is it easier 
to check this, as for statistics (min/max) expressions like {{(name > 'a') & 
(name < 'a')}} we would need to check for the special case of min and max being 
equal (I think this is the only case we can say something with certainly for an 
"isin" expression?)

Code example:


>>> table = pa.table(\{"name": np.repeat(["a", "b", "c", "d"], 5), "value": 
>>> np.arange(20)\})
>>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
>>> dataset = ds.dataset("test_filter_string.parquet")# get the single file 
>>> fragment (dataset consists of one file)>>> fragment = 
>>> list(dataset.get_fragments())[0]
>>> fragment.ensure_complete_metadata()

# check that we do have statistics for our row groups
>>> fragment.row_groups[0].statistics
{'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}

# I created the file such that there are 4 row groups (each with a unique value 
in the name column)
>>> fragment.split_by_row_group()
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]

# simple equality filter works as expected -> only single row group left
>>> filter = ds.field("name") == "a"
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]

# isin filter does not work
>>> filter = ds.field("name").isin(["a", "b"])
>>> fragment.split_by_row_group(filter)
[<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
 <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]
{python}


> [C++][Dataset] Support isin filter for row group (statistics-based) filtering
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-10091
>                 URL: https://issues.apache.org/jira/browse/ARROW-10091
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset
>
> Currently the {{isin}} filter works for partition-based filtering, but not 
> for row group (statistics)-based filtering. 
> Of course, for a partition-based expression like {{name == 'a'}} is it easier 
> to check this, as for statistics (min/max) expressions like {{(name > 'a') & 
> (name < 'a')}} we would need to check for the special case of min and max 
> being equal (I think this is the only case we can say something with 
> certainly for an "isin" expression?)
> Code example:
> {code:python}
> >>> table = pa.table(\{"name": np.repeat(["a", "b", "c", "d"], 5), "value": 
> >>> np.arange(20)\})
> >>> pq.write_table(table, "test_filter_string.parquet", row_group_size=5)
> >>> dataset = ds.dataset("test_filter_string.parquet")# get the single file 
> >>> fragment (dataset consists of one file)>>> fragment = 
> >>> list(dataset.get_fragments())[0]
> >>> fragment.ensure_complete_metadata()
> # check that we do have statistics for our row groups
> >>> fragment.row_groups[0].statistics
> {'name': {'min': 'a', 'max': 'a'}, 'value': {'min': 0, 'max': 4}}
> # I created the file such that there are 4 row groups (each with a unique 
> value in the name column)
> >>> fragment.split_by_row_group()
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783939810>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783728cd8>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff78376c9c0>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff7835efd68>]
> # simple equality filter works as expected -> only single row group left
> >>> filter = ds.field("name") == "a"
> >>> fragment.split_by_row_group(filter)
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff783662738>]
> # isin filter does not work
> >>> filter = ds.field("name").isin(["a", "b"])
> >>> fragment.split_by_row_group(filter)
> [<pyarrow._dataset.ParquetFileFragment at 0x7ff7837f46f0>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783627a98>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff783581b70>,
>  <pyarrow._dataset.ParquetFileFragment at 0x7ff7835fb780>]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to