Yin created ARROW-15724:
---------------------------
Summary: reduce directory and file IO when reading partition
parquet dataset
Key: ARROW-15724
URL: https://issues.apache.org/jira/browse/ARROW-15724
Project: Apache Arrow
Issue Type: Improvement
Reporter: Yin
Attachments: pq.py
Hi,
It seems Arrow accesses all partitions directories (even the parquet files),
including those clearly not matching with the partition key values in the
filters. This may cause multiple times difference between accessing one
partition directly vs accessing with partition key filters,
specially on Network file system, and on local file system when there are lots
of partitions, e.g. 1/10th of second vs seconds.
Attached Python code to create example dataframe and save parquet datasets with
different hive partition structure (/y=/m=/d=, or /y=/m=, or /dk=). And read
the datasets with/without filters to reproduce the issue. Observe the run time,
and the directories and files which are accessed by the process in Process
Monitor on Windows.
In the three partition structures, I saw in Process Monitor that all
directories are accessed regardless of use_legacy_dataset=True or False.
When use_legacy_dataset=False, the parquet files in all directories were
opened.
The argument validate_schema=False made small time difference, but still opens
the partition directories, and it's only supported when
use_legacy_dataset=True, and not supported/passed in from pandas read_parquet
wrapper API.
The /y=/m= is faster since there is no daily partition so less directories and
files.
There was a related another stackoverflow question and example
[https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
and there was a comment on the partition discovery:
{quote}It should get discovered automatically. pd.read_parquet calls
pyarrow.parquet.read_table and the default partitioning behavior should be to
discover hive-style partitions (i.e. the ones you have). The fact that you have
to specify this means that discovery is failing. If you could create a
reproducible example and submit it to Arrow JIRA it would be helpful.
– Pace Feb 24 2021 at 18:55"
{quote}
Wonder if there was some related Jira here already.
I tried passing in partitioning argument, it didn't help.
The version of pyarrow used were 1.01, 5, and 7.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)