[
https://issues.apache.org/jira/browse/ARROW-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Grove resolved ARROW-10995.
--------------------------------
Resolution: Fixed
Issue resolved by pull request 9029
[https://github.com/apache/arrow/pull/9029]
> [Rust] [DataFusion] Improve parallelism when reading Parquet files
> ------------------------------------------------------------------
>
> Key: ARROW-10995
> URL: https://issues.apache.org/jira/browse/ARROW-10995
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust - DataFusion
> Reporter: Andy Grove
> Assignee: Andy Grove
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.0.0
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> Currently the unit of parallelism is the number of parquet files being read.
> For example, if we run a query against a Parquet table that consists of 8
> partitions then we will attempt to run 8 async tasks in parallel and if there
> is a single Parquet file then we will only try and run 1 async task so this
> does not scale well. Also, if there are hundreds or thousands of Parquet
> files then we will try and process them all concurrently which also doesn't
> scale well.
> These are the options for improving this situation:
>
> # Use Parquet row groups as the unit of partitioning and divide the number
> of row groups by the desired level of concurrency (defaulting to number of
> cores)
> # Keep file as the unit of partitions and add a RepartitionExec into the
> plan if there are fewer partitions (files) than cores and in the case where
> there are more files than cores, split the files up into lists so that each
> partition is a list of files rather than a single file. Each partition task
> will process one file at a time.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)