jorisvandenbossche commented on issue #34162:
URL: https://github.com/apache/arrow/issues/34162#issuecomment-1433256003

   @Fokko I am trying to reproduce this with just pyarrow, but for now not 
succeeding.
   
   First, just checking if plain filtering on in memory data works (which it 
does):
   
   ```
   >>> table = pa.table({"idx": [1, 2, 3], "col_numeric": [np.nan, None, 1]})
   >>> table
   pyarrow.Table
   idx: int64
   col_numeric: double
   ----
   idx: [[1,2,3]]
   col_numeric: [[nan,null,1]]
   
   >>> table.filter(pc.field('col_numeric').is_null(nan_is_null=True) & 
~pc.field('col_numeric').is_null())
   pyarrow.Table
   idx: int64
   col_numeric: double
   ----
   idx: [[1]]
   col_numeric: [[nan]]
   ```
   
   Then, if I write this file to Parquet and ensure to get each row in one row 
group (still in a single file), reading it with a filter should use the row 
group statistics for pruning as a first step. This also seems to work:
   
   ```
   >>> pq.write_table(table, "test_filter_nan.parquet", row_group_size=1)
   >>> meta = pq.read_metadata("test_filter_nan.parquet")
   >>> meta.num_row_groups
   3
   >>> pq.read_table("test_filter_nan.parquet", 
filters=pc.field('col_numeric').is_null(nan_is_null=True) & 
~pc.field('col_numeric').is_null())
   pyarrow.Table
   idx: int64
   col_numeric: double
   ----
   idx: [[1]]
   col_numeric: [[nan]]
   ```
   
   Now, maybe this depends on how the Parquet file was written. When written 
with pyarrow as above, the row groups with the values NaN and null don't have 
statistics set (and so won't never be skipped or not because of predicate 
pushdown rowgroup filtering).  
   @Fokko Your files were created with Spark, I assume? Would it be possible to 
share those 3 small parquet files from your example above?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to