[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

GitBox Thu, 03 Mar 2022 07:07:38 -0800


yjshen edited a comment on pull request #1905:
URL: 
https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557



   Yes, I'm aware of parallelizing ability the current API exposed out. 
However, it's hard to express or utilize in the current execution plan: how 
should I trigger the parallel chunk fetch while maintaining a single-partition 
sterilization read? Instead, we have `PartitionedFile` abstraction that can be 
extended with file slicing ability. 
   
   ```rust
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel 
execution
   }
   ```
   
   For example, by enabling parquet scan with row groups ability 
https://github.com/apache/arrow-rs/pull/1389, we could utilize the above 
PartitionedFile's last comment with real ranges when we want a finer-grained 
fetch and execution. And to control the parallelism of FileSan execution, we 
could tune a `max_byte_per_partition` configuration and partition all input 
files into `Vec<Vec<PartitionedFile>`. 
   
   Each `Vec<PartitionedFile>` could be summed up to the 
`max_byte_per_partition` size, from many individual parquet files, or one big 
slice from one big parquet file.
   
   By controlling `max_byte_per_partition`, we could still achieve the parallel 
fetch of file chunks as you mentioned, if users choose a smaller partition 
input data size. Or avoid unexpected repeated opening of files, at each 
row-group, each column times.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Reply via email to