Blaž Zupančič created ARROW-16982:
-------------------------------------

             Summary: Slow reading of partitioned parquet files from S3
                 Key: ARROW-16982
                 URL: https://issues.apache.org/jira/browse/ARROW-16982
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Parquet, Python
    Affects Versions: 8.0.0
            Reporter: Blaž Zupančič


When reading partitioned files from S3 and using filters to select partitions, 
the reader will send list requests each time read_table() is called.
{code:python}
# partitioning: s3://bucket/year=xxxx/month=y/day=z

from pyarrow import parquet
parquet.read_table('s3://bucket', filters=[('day', '=', 1)]) # lists s3 bucket
parquet.read_table('s3://bucket', filters=[('day', '=', 2)]) # lists again{code}
This is not a problem if done once, but repeated calls to select different 
partitions lead to a large amount of (slow and potentially expensive) S3 list 
requests.

Current workaround is to list and filter partition structure manually, however 
this is not nearly as convenient as using filters.

If we know that the S3 prefixes did not change, it should be possible to do 
recursive list only once and load different data multiple times (using only S3 
get requests). I suppose this should be possible by using ParquetDataset, 
however current implementation only allows filters in constructor and not in 
the read() method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to