alamb commented on issue #20529:
URL: https://github.com/apache/datafusion/issues/20529#issuecomment-4023988876

   Here are some thoughts (it is getting hard to keep track of what is going on 
on the PR https://github.com/apache/datafusion/pull/20481). I have been looking 
at https://github.com/apache/datafusion/pull/20481 in order to figure out how 
we can most smoothly structure the code / ideas in this PR into the existing 
code of DataFusion
   
   Here is my summary of the architectural changes to 
https://github.com/apache/datafusion/pull/20481:
   * The 
[`FileStream`](https://github.com/alamb/datafusion/blob/df6c035e68e9508029c9ba5b0979dad428573e63/datafusion/datasource/src/file_stream.rs#L60-L59)
 is updated to know about "Morsels"
   * A new parallel API is added to the [`FileOpener` API 
](https://github.com/alamb/datafusion/blob/df6c035e68e9508029c9ba5b0979dad428573e63/datafusion/datasource/src/file_stream.rs#L588-L587):
 "morselize" that takes a single PartitionedFile and breaks it into Morsels 
(which are smaller `PartitionedFiles`)
   
   Challenges I see with this design (all can be overcome with some more code):
   1. As there is a parallel API in FileOpener that has a parallel code path 
this may be hard to test
   2. It may also be hard to apply the morsel idea to non parquet paths (even 
though the idea is absolutely applicable). However, it might also be ok 
   3. It uses the "extensions" field to [stash the 
morsels](https://github.com/alamb/datafusion/blob/df6c035e68e9508029c9ba5b0979dad428573e63/datafusion/datasource-parquet/src/opener.rs#L253-L252),
 which I think will break some downstream users who use that for other things 
(like trace context, for example)
   4. It is not clear to me how we will add other features like configurable IO 
prefetching without hard coding more into FileOpener
   
   Ideas going forward:
   1. Rather than add a parallel API to the FileOpener API, I think we should 
try and make Morsels explicit in the pipeline somehow (perhaps via a 
`Morselizer` and rename `FileOpener` to`MorselOpener` trait 🤔 )
   3. Generalize the existing code that tries to statically split files based 
on byte offsets -- perhaps create morsels there as well.
   4. Isolate the work stealing to a new structure / trait (as we will need to 
turn it off for some plans, e.g. those that require sortedness)
   
   I am going to try and prototype what a more explicit "morselizer"  / 
"FileOpener' might look like
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to