[jira] [Created] (ARROW-8208) [PYTHON] RowGroup filtering with ParquetDataset

Christophe Clienti (Jira) Wed, 25 Mar 2020 03:50:32 -0700

Christophe Clienti created ARROW-8208:
-----------------------------------------


             Summary: [PYTHON] RowGroup filtering with ParquetDataset
                 Key: ARROW-8208
                 URL: https://issues.apache.org/jira/browse/ARROW-8208
             Project: Apache Arrow
          Issue Type: New Feature
            Reporter: Christophe Clienti


Hello,

I tried to use the row_group filtering at the file level with an instance of 
ParquetDataset without success.

I've tested the workaround propose here:
 [https://github.com/pandas-dev/pandas/issues/26551#issuecomment-497039883]

But I wonder if it can work on a file as I get an exception with the following 
code:
{code:python}
ParquetDataset('data.parquet',
               filters=[('ticker', '=', 'AAPL')]).read().to_pandas()
{code}
{noformat}
AttributeError: 'NoneType' object has no attribute 'filter_accepts_partition'
{noformat}
I read the documentation, and the filtering seems to work only on partitioned 
dataset. Moreover I read some information in the following JIRA ticket:
 https://issues.apache.org/jira/browse/ARROW-1796

So I'm not sure that a ParquetDataset can use row_group statistics to filter 
specific row_group in a file in a dataset or not?

As mentioned in ARROW-1796, I tried with fastparquet, and after fixing a bug 
(statistics.min instead of statistics.min_value), I was able to apply the 
row_group filtering.

Today I'm forced with pyarrow to filter manually the row_groups in each file, 
which prevents me to use the ParquetDataset partition filtering functionality.

The row groups are really useful because it prevents to fill the filesystem 
with small files...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8208) [PYTHON] RowGroup filtering with ParquetDataset

Reply via email to