[GitHub] [arrow-datafusion] alamb commented on issue #464: Question: Can DataFusion handle larger than RAM datasets?

GitBox Fri, 04 Jun 2021 14:09:34 -0700


alamb commented on issue #464:
URL: 
https://github.com/apache/arrow-datafusion/issues/464#issuecomment-854998756



   @voltcode  -- DataFusion is at its core an in memory processing system.
   
   That being said, depending on what the plan is doing, simply reading from a 
large number of parquet files does not necessarily mean they will be 
decompressed all at once into memory.
   
   DataFusion has several features that keep the memory usage down:
   1. It will only read columns required for the query "projection pushdown"
   2. It will attempt to prune row groups  (based on metadata) and skip them 
entirely if possible
   3. It has a "streaming" model of computation and so will read the parquet 
files into memory in small batches.
   
   Certain operations in DataFusion are likely to consume large amounts of 
memory, notable "Sort" and "Join" (as well as grouping where there are large 
numbers of distinct groups)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #464: Question: Can DataFusion handle larger than RAM datasets?

Reply via email to