thinkharderdev commented on issue #30:
URL: https://github.com/apache/arrow-ballista/issues/30#issuecomment-1276062272

   > > I'm a little confused here. Avoiding fetching 100s of parquet files is 
more like an optimizer issue?
   > 
   > What I mean is that now, if you give a datafusion plan 5000 parquet files 
(maybe for a `SUM()` type query) , it will likely try to start reading / 
decoding all 5000 files concurrently, even if the downstream operators can only 
consume a small fraction at once . This means the resources (file handles 
and/or memory buffers) for reading all 5000 are held open during the plan.
   > 
   > It also means if the plan terminates early (e.g. `... LIMIT 10`) a large 
amount of IO will be done / wasted.
   > 
   > What we would like to happen is that a smaller subset are read and fully 
processed before new ones are opened. Of course, one challenge with doing this 
is we still need sufficient parallelism to hide IO latencies.
   
   Within a given partition `FileStream` will still process them sequentially 
right? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to