GitHub user alexwilcoxson-rel closed a discussion: Memory usage during Parquet scan
Hi DataFusion community! I'm relatively new to Arrow, DataFusion, and Rust and had a question about how parquet records are streamed. I have an Arrow Flight service in front of DataFusion. When scanning a 50GB table of ~1GB-256MB physical parquet files. I see memory around ~1-2GB. I get the sense as the arrow batches are yielded to the outbound flight stream the used memory is not dropped until a given file is complete? Is this accurate? I'm asking as we look to support many requests each holding onto a whole file requires a lot of memory. When we reduced the parquet file size to ~32MB each we could support a lot more requests, but was wondering if there was something datafusion related we could change in configuration or feature request to keep memory usage low when streaming larger files. We are using version 19. Thanks! GitHub link: https://github.com/apache/datafusion/discussions/5901 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
