GitHub user ndchandar edited a discussion: Feedback on high memory usage when 
merging N parquet files

Hello,
I am writing a program that takes N parquet files (where N = 40). Each source 
parquet file is about ~6 to ~8 MB  in size and are Zstd compressed. They are 
compacted/combined to produce a bigger sized parquet file (~220 to ~250 MB). It 
appears that we need as much as **~24 GB** of memory to have a successful 
compaction. This gist 
https://gist.github.com/ndchandar/3900558ff719cefeb8b058e36a18f8be#file-parquet_rewriter-rs-L32-L138
  (`export_with_datafusion`) is the interesting bit. It basically lists all 
files in a directory, takes N files and compacts them. I tried giving hints to 
the optimizer that the sources are already sorted but it doesn't seem to help. 
The row group size is set to 1M. 

Giving less memory (E.g 12 or 16 gb), I am running into the below issue
```
Caused by:
    Resources exhausted: Failed to allocate additional 2.0 MB for 
ExternalSorterMerge[4] with 49.8 MB already allocated for this reservation - 
1826.2 KB remain available for the total pool
```
I am trying to understand why the spill is not happening efficiently (I am 
relatively new to DataFusion). Looking for any help/hints to reduce the memory 
utilization

GitHub link: https://github.com/apache/datafusion/discussions/18833

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to