alamb commented on issue #16841:
URL: https://github.com/apache/datafusion/issues/16841#issuecomment-3355803139

   > I started some work on this in 
https://github.com/apache/datafusion/pull/17758 but really I think the ideal 
scenario is a config along the lines of "buffer up to 1GB of data from 
DataSourceExec" and then that can be opening 1024 files that produce 1kB of 
data each or 1 file that produces a 2GB RecordBatch on each poll.
   
   I have been thinking a lot about this too in relation to a push parquet 
decoder
   - https://github.com/apache/arrow-rs/issues/7983
   
   Currently, the arrow-rs parquet decoder decodes in the unit of a row group 
(basically it fetches all data pages in a row group needed for reading a 
column, after accounting for predicates). This does limit the number of IOs, 
which is important for reading from object storage, but it also means that the 
minimum memory usage is a function of how large the row group was, which means 
it can't be limited to some set amount easily
   
   I am hoping that once we have a push decoder working (I hope to get this 
ready for review later this week) it will finally be feasible to allow decoding 
a subset of the row groups to trade off # of requests with memory usage


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to