Joris Van den Bossche created ARROW-9105: --------------------------------------------
Summary: [C++] ParquetFileFragment::SplitByRowGroup doesn't handle filter on partition field Key: ARROW-9105 URL: https://issues.apache.org/jira/browse/ARROW-9105 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Fix For: 1.0.0 When splitting a fragment into row group fragments, filtering on the partition field raises an error. Python reproducer: ``` df = pd.DataFrame({"dummy": [1, 1, 1, 1], "part": ["A", "A", "B", "B"]}) df.to_parquet("test_partitioned_filter", partition_cols="part", engine="pyarrow") import pyarrow.dataset as ds dataset = ds.dataset("test_partitioned_filter", format="parquet", partitioning="hive") fragment = list(dataset.get_fragments())[0] ``` ``` In [31]: dataset.to_table(filter=ds.field("part") == "A").to_pandas() Out[31]: dummy part 0 1 A 1 1 A In [32]: fragment.split_by_row_group(ds.field("part") == "A") --------------------------------------------------------------------------- ArrowInvalid Traceback (most recent call last) <ipython-input-32-371cba80fd6f> in <module> ----> 1 fragment.split_by_row_group(ds.field("part") == "A") ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.ParquetFileFragment.split_by_row_group() ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset._insert_implicit_casts() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowInvalid: Field named 'part' not found or not unique in the schema. ``` This is probably a "strange" thing to do, since the fragment from a partitioned dataset is already coming only from a single partition (so will always only satisfy a single equality expression). But it's still nice that as a user you don't have to care about only passing part of the filter down to {{split_by_row_groups}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)