alamb commented on issue #464: URL: https://github.com/apache/arrow-datafusion/issues/464#issuecomment-854998756
@voltcode -- DataFusion is at its core an in memory processing system. That being said, depending on what the plan is doing, simply reading from a large number of parquet files does not necessarily mean they will be decompressed all at once into memory. DataFusion has several features that keep the memory usage down: 1. It will only read columns required for the query "projection pushdown" 2. It will attempt to prune row groups (based on metadata) and skip them entirely if possible 3. It has a "streaming" model of computation and so will read the parquet files into memory in small batches. Certain operations in DataFusion are likely to consume large amounts of memory, notable "Sort" and "Join" (as well as grouping where there are large numbers of distinct groups) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
