[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

GitBox Thu, 03 Mar 2022 06:55:16 -0800


yjshen edited a comment on pull request #1905:
URL: 
https://github.com/apache/arrow-datafusion/pull/1905#issuecomment-1058121557



   Yes, I'm aware of parallelizing ability the current API exposed out, 
however, it's hard to express or fully get utilized in the current execution 
plan: how should I trigger the current parallel chunk fetch while maintaining 
single-partition sterilization read? Instead, we have `PartitionedFile` 
abstraction that can be extended with file slicing ability. 
   
   ```rust
   /// A single file that should be read, along with its schema, statistics
   /// and partition column values that need to be appended to each row.
   pub struct PartitionedFile {
       /// Path for the file (e.g. URL, filesystem path, etc)
       pub file_meta: FileMeta,
       /// Values of partition columns to be appended to each row
       pub partition_values: Vec<ScalarValue>,
       // We may include row group range here for a more fine-grained parallel 
execution
   }
   ```
   
   for example, by enabling parquet scan with row groups ability 
https://github.com/apache/arrow-rs/pull/1389, we could utilize the above 
PartitionedFile's last comment with real ranges when we want a finer-grained 
fetch and execution. And in order to control the parallelism of FileSan 
execution, we could just tune a 'max_byte_per_split' configuration, and 
partition all input files into `Vec<Vec<PartitionedFile>`, each 
`Vec<PartitionedFile>` could be summed up to the 'max_byte_per_split' size, 
from many individual parquet files, or one big slice from one big parquet file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen edited a comment on pull request #1905: Avoid repeated `open` for one single file and simplify object reader API on the `sync` part

Reply via email to