yjshen edited a comment on pull request #1905:
URL:
https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557
Yes, I'm aware of parallelizing ability the current API exposed out.
However, it's hard to express or utilize in the current execution plan: how
should I trigger the parallel chunk fetch while maintaining a single-partition
sterilization read? Instead, we have `PartitionedFile` abstraction that can be
extended with file slicing ability.
```rust
/// A single file that should be read, along with its schema, statistics
/// and partition column values that need to be appended to each row.
pub struct PartitionedFile {
/// Path for the file (e.g. URL, filesystem path, etc)
pub file_meta: FileMeta,
/// Values of partition columns to be appended to each row
pub partition_values: Vec<ScalarValue>,
// We may include row group range here for a more fine-grained parallel
execution
}
```
For example, by enabling parquet scan with row groups ability
https://github.com/apache/arrow-rs/pull/1389, we could utilize the above
PartitionedFile's last comment with real ranges when we want a finer-grained
fetch and execution. And to control the parallelism of FileSan execution, we
could tune a `max_byte_per_partition` configuration and partition all input
files into `Vec<Vec<PartitionedFile>`.
Each `Vec<PartitionedFile>` could be summed up to the
`max_byte_per_partition` size, from many individual parquet files, or one big
slice from one big parquet file.
By controlling `max_byte_per_partition`, we could still achieve the parallel
fetch of file chunks as you mentioned, if users choose a smaller partition
input data size. Or avoid unexpected repeated opening of files, at each
row-group, each column times.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]