jorisvandenbossche opened a new issue, #39064: URL: https://github.com/apache/arrow/issues/39064
Currently the filtering of row groups based on a predicate only supports non-nested paths. When getting the statistics, this only works for a leaf node: https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/dataset/file_parquet.cc#L160-L170 but we are calling this ColumnChunkStatisticsAsExpression function with the struct parent, and not with the struct field leaf. The `schema_field` passed to the function above is created with `match[0]`, i.e. only the first part of the matching field path: https://github.com/apache/arrow/blob/f7947cc21bf78d67cf5ac1bf1894b5e04de1a632/cpp/src/arrow/dataset/file_parquet.cc#L903 --- To illustrate this, creating a small test file with a nested struct column and consisting of two row groups: ```python import pyarrow as pa import pyarrow.parquet as pq struct_arr = pa.StructArray.from_arrays([[1, 2, 3, 4]]*4, names=["xmin", "xmax", "ymin", "ymax"]) table = pa.table({"geom": [1, 2, 3, 4], "bbox": struct_arr}) pq.write_table(table, "test_bbox_struct.parquet", row_group_size=2) ``` Reading this through the Datasets API with a filter _seems_ to filter this correctly: ```python import pyarrow.dataset as ds dataset = ds.dataset("test_bbox_struct.parquet", format="parquet") dataset.to_table(filter=ds.field("bbox", "xmax") <=2).to_pandas() # geom bbox # 0 1 {'xmin': 1, 'xmax': 1, 'ymin': 1, 'ymax': 1} # 1 2 {'xmin': 2, 'xmax': 2, 'ymin': 2, 'ymax': 2} ``` However, that is only because we correctly filter this with a nested field ref in the second step, i.e. doing an actual filter operation after reading the data. But if we look at APIs that just does the row group filtering step, we can see this is currently not being filtered at the row group stage: ```python In [2]: fragment = list(dataset.get_fragments())[0] In [3]: fragment.split_by_row_group() Out[3]: [<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>, <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>] In [4]: fragment.split_by_row_group(filter=ds.field("bbox", "xmax") <=2) Out[4]: [<pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>, <pyarrow.dataset.ParquetFileFragment path=test_bbox_struct.parquet>] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
