[jira] [Updated] (ARROW-15724) reduce directory and file IO when reading partition parquet dataset

Yin (Jira) Thu, 17 Feb 2022 12:25:06 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-15724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yin updated ARROW-15724:
------------------------
    Description: 
Hi,
It seems that Arrow accesses all partitions directories (and even each parquet 
files), including those clearly not matching with the partition key values in 
the filter criteria. This may cause multiple time of difference between 
accessing one partition directly vs accessing with partition key filters, 
specially on Network file system, and on local file system when there are lots 
of partitions, e.g. 1/10th of second vs seconds.

Attached some Python code to create example dataframe and save parquet datasets 
with different hive partition structures (/y=/m=/d=, or /y=/m=, or /dk=). And 
read the datasets with/without filters to reproduce the issue. Observe the run 
time, and the directories and files which are accessed by the process in 
Process Monitor on Windows.

In the three partition structures, I saw in Process Monitor that all 
directories are accessed regardless of use_legacy_dataset=True or False. 
When use_legacy_dataset=False, the parquet files in all directories were opened 
and closed.  
The argument validate_schema=False made small time difference, but still opens 
the partition directories, and it's only supported when 
use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
wrapper API. 

The /y=/m= is faster because there is no daily partition so less directories 
and files.

There was a related another stackoverflow question and example 
[https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
and there was a comment on the partition discovery:
{quote}It should get discovered automatically. pd.read_parquet calls 
pyarrow.parquet.read_table and the default partitioning behavior should be to 
discover hive-style partitions (i.e. the ones you have). The fact that you have 
to specify this means that discovery is failing. If you could create a 
reproducible example and submit it to Arrow JIRA it would be helpful. 
– Pace  Feb 24 2021 at 18:55"
{quote}
Wonder if there were some related Jira here already.
I tried passing in partitioning argument, but it didn't help. 
The version of pyarrow used were 1.01, 5, and 7.

  was:
Hi,
It seems Arrow accesses all partitions directories (even the parquet files), 
including those clearly not matching with the partition key values in the 
filters. This may cause multiple times  difference between accessing one 
partition directly vs accessing with partition key filters, 
specially on Network file system, and on local file system when there are lots 
of partitions, e.g. 1/10th of second vs seconds.

Attached Python code to create example dataframe and save parquet datasets with 
different hive partition structure (/y=/m=/d=, or /y=/m=, or /dk=). And read 
the datasets with/without filters to reproduce the issue. Observe the run time, 
and the directories and files which are accessed by the process in Process 
Monitor on Windows.

In the three partition structures, I saw in Process Monitor that all 
directories are accessed regardless of use_legacy_dataset=True or False. 
When use_legacy_dataset=False, the parquet files in all directories were 
opened.  
The argument validate_schema=False made small time difference, but still opens 
the partition directories, and it's only supported when 
use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
wrapper API. 

The /y=/m= is faster since there is no daily partition so less directories and 
files.


There was a related another stackoverflow question and example 
[https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
and there was a comment on the partition discovery:
{quote}It should get discovered automatically. pd.read_parquet calls 
pyarrow.parquet.read_table and the default partitioning behavior should be to 
discover hive-style partitions (i.e. the ones you have). The fact that you have 
to specify this means that discovery is failing. If you could create a 
reproducible example and submit it to Arrow JIRA it would be helpful. 
– Pace  Feb 24 2021 at 18:55"
{quote}
Wonder if there was some related Jira here already.
I tried passing in partitioning argument, it didn't help. 
The version of pyarrow used were 1.01, 5, and 7.


> reduce directory and file IO when reading partition parquet dataset
> -------------------------------------------------------------------
>
>                 Key: ARROW-15724
>                 URL: https://issues.apache.org/jira/browse/ARROW-15724
>             Project: Apache Arrow
>          Issue Type: Improvement
>            Reporter: Yin
>            Priority: Major
>         Attachments: pq.py
>
>
> Hi,
> It seems that Arrow accesses all partitions directories (and even each 
> parquet files), including those clearly not matching with the partition key 
> values in the filter criteria. This may cause multiple time of difference 
> between accessing one partition directly vs accessing with partition key 
> filters, 
> specially on Network file system, and on local file system when there are 
> lots of partitions, e.g. 1/10th of second vs seconds.
> Attached some Python code to create example dataframe and save parquet 
> datasets with different hive partition structures (/y=/m=/d=, or /y=/m=, or 
> /dk=). And read the datasets with/without filters to reproduce the issue. 
> Observe the run time, and the directories and files which are accessed by the 
> process in Process Monitor on Windows.
> In the three partition structures, I saw in Process Monitor that all 
> directories are accessed regardless of use_legacy_dataset=True or False. 
> When use_legacy_dataset=False, the parquet files in all directories were 
> opened and closed.  
> The argument validate_schema=False made small time difference, but still 
> opens the partition directories, and it's only supported when 
> use_legacy_dataset=True, and not supported/passed in from pandas read_parquet 
> wrapper API. 
> The /y=/m= is faster because there is no daily partition so less directories 
> and files.
> There was a related another stackoverflow question and example 
> [https://stackoverflow.com/questions/66339381/pyarrow-read-single-file-from-partitioned-parquet-dataset-is-unexpectedly-slow]
> and there was a comment on the partition discovery:
> {quote}It should get discovered automatically. pd.read_parquet calls 
> pyarrow.parquet.read_table and the default partitioning behavior should be to 
> discover hive-style partitions (i.e. the ones you have). The fact that you 
> have to specify this means that discovery is failing. If you could create a 
> reproducible example and submit it to Arrow JIRA it would be helpful. 
> – Pace  Feb 24 2021 at 18:55"
> {quote}
> Wonder if there were some related Jira here already.
> I tried passing in partitioning argument, but it didn't help. 
> The version of pyarrow used were 1.01, 5, and 7.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15724) reduce directory and file IO when reading partition parquet dataset

Reply via email to