yjshen opened a new pull request #932:
URL: https://github.com/apache/arrow-datafusion/pull/932
# Which issue does this PR close?
Closes #.
# Rationale for this change
1. For potentially finer-grained readers that parallelize even one file
reading or balancing workload between scanning threads even in case of great
variance in input file sizes. As I quote @andygrove
[here](https://docs.google.com/document/d/1ZEZqvdohrot0ewtTNeaBtqczOIJ1Q0OnX9PqMMxpOF8/edit?disco=AAAAN5XUero):
> One of the current issues IMO with DataFusion is that we use "file" as the
default unit of partitioning. We would be able to scale better if we had
finer-grained readers such as reading Parquet row groups instead. This way we
can have multiple threads reading from the same file concurrently and avoid the
need to repartition first to increase concurrency.
2. Refactoring Logic in ParquetExec and parquet datasource. It's strange to
call `ParquetExec:: try_from_path` to get planning-related metadata.
# What changes are included in this PR?
1. PartitionedFile -> Single file (for the moment) or part of a file (later,
part of the row groups or rows), and we may even extend this to include
partition value and partition schema to support partitioned tables:
/path/to/table/root/p_date=20210813/p_hour=1200/xxxxx.parquet
2. FilePartition -> The basic unit for parallel processing, each task is
responsible for processing one FilePartition which is composed of several
PartitionFiles.
3. Telling apart the planning related code from `ParquetExec`
# Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]