jorisvandenbossche commented on issue #34162:
URL: https://github.com/apache/arrow/issues/34162#issuecomment-1433256003
@Fokko I am trying to reproduce this with just pyarrow, but for now not
succeeding.
First, just checking if plain filtering on in memory data works (which it
does):
```
>>> table = pa.table({"idx": [1, 2, 3], "col_numeric": [np.nan, None, 1]})
>>> table
pyarrow.Table
idx: int64
col_numeric: double
----
idx: [[1,2,3]]
col_numeric: [[nan,null,1]]
>>> table.filter(pc.field('col_numeric').is_null(nan_is_null=True) &
~pc.field('col_numeric').is_null())
pyarrow.Table
idx: int64
col_numeric: double
----
idx: [[1]]
col_numeric: [[nan]]
```
Then, if I write this file to Parquet and ensure to get each row in one row
group (still in a single file), reading it with a filter should use the row
group statistics for pruning as a first step. This also seems to work:
```
>>> pq.write_table(table, "test_filter_nan.parquet", row_group_size=1)
>>> meta = pq.read_metadata("test_filter_nan.parquet")
>>> meta.num_row_groups
3
>>> pq.read_table("test_filter_nan.parquet",
filters=pc.field('col_numeric').is_null(nan_is_null=True) &
~pc.field('col_numeric').is_null())
pyarrow.Table
idx: int64
col_numeric: double
----
idx: [[1]]
col_numeric: [[nan]]
```
Now, maybe this depends on how the Parquet file was written. When written
with pyarrow as above, the row groups with the values NaN and null don't have
statistics set (and so won't never be skipped or not because of predicate
pushdown rowgroup filtering).
@Fokko Your files were created with Spark, I assume? Would it be possible to
share those 3 small parquet files from your example above?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]