Christophe Clienti created ARROW-8208: -----------------------------------------
Summary: [PYTHON] RowGroup filtering with ParquetDataset Key: ARROW-8208 URL: https://issues.apache.org/jira/browse/ARROW-8208 Project: Apache Arrow Issue Type: New Feature Reporter: Christophe Clienti Hello, I tried to use the row_group filtering at the file level with an instance of ParquetDataset without success. I've tested the workaround propose here: [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883] But I wonder if it can work on a file as I get an exception with the following code: {code:python} ParquetDataset('data.parquet', filters=[('ticker', '=', 'AAPL')]).read().to_pandas() {code} {noformat} AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition' {noformat} I read the documentation, and the filtering seems to work only on partitioned dataset. Moreover I read some information in the following JIRA ticket: https://issues.apache.org/jira/browse/ARROW-1796 So I'm not sure that a ParquetDataset can use row_group statistics to filter specific row_group in a file in a dataset or not? As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug (statistics.min instead of statistics.min_value), I was able to apply the row_group filtering. Today I'm forced with pyarrow to filter manually the row_groups in each file, which prevents me to use the ParquetDataset partition filtering functionality. The row groups are really useful because it prevents to fill the filesystem with small files... -- This message was sent by Atlassian Jira (v8.3.4#803005)