[I] [C++][Parquet] Support row group filtering for nested paths [arrow]

via GitHub Mon, 04 Dec 2023 08:32:21 -0800


jorisvandenbossche opened a new issue, #39064:
URL: https://github.com/apache/arrow/issues/39064


   Currently the filtering of row groups based on a predicate only supports 
non-nested paths. When getting the statistics, this only works for a leaf node:
   
   
https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/dataset/file_parquet.cc#L160-L170
   
   but we are calling this ColumnChunkStatisticsAsExpression function with the 
struct parent, and not with the struct field leaf. The `schema_field` passed to 
the function above is created with `match[0]`, i.e. only the first part of the 
matching field path:
   
   
https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/dataset/file_parquet.cc#L903
   
   ---
   
   To illustrate this, creating a small test file with a nested struct column 
and consisting of two row groups:
   
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   struct_arr = pa.StructArray.from_arrays([[1, 2, 3, 4]]*4, names=["xmin", 
"xmax", "ymin", "ymax"])
   table = pa.table({"geom": [1, 2, 3, 4], "bbox": struct_arr})
   
   pq.write_table(table, "test_bbox_struct.parquet", row_group_size=2)
   ```
   
   Reading this through the Datasets API with a filter _seems_ to filter this 
correctly:
   
   ```python
   import pyarrow.dataset as ds
   dataset = ds.dataset("test_bbox_struct.parquet", format="parquet")
   
   dataset.to_table(filter=ds.field("bbox", "xmax") <=2).to_pandas()
   #    geom                                          bbox
   # 0     1  {'xmin': 1, 'xmax': 1, 'ymin': 1, 'ymax': 1}
   # 1     2  {'xmin': 2, 'xmax': 2, 'ymin': 2, 'ymax': 2}
   ```
   
   However, that is only because we correctly filter this with a nested field 
ref in the second step, i.e. doing an actual filter operation after reading the 
data.  But if we look at APIs that just does the row group filtering step, we 
can see this is currently not being filtered at the row group stage:
   
   ```python
   In [2]: fragment = list(dataset.get_fragments())[0]
   
   In [3]: fragment.split_by_row_group()
   Out[3]: 
   [<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
    <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]
   
   In [4]: fragment.split_by_row_group(filter=ds.field("bbox", "xmax") <=2)
   Out[4]: 
   [<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>,
    <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [C++][Parquet] Support row group filtering for nested paths [arrow]

Reply via email to