REASY opened a new issue, #4973: URL: https://github.com/apache/arrow-rs/issues/4973
**Describe the bug** When write Parquet using arrow-rs with large array columns, memory consumption in case of Page statistics 10x larger. The schema of Parquet file [has the following fields](https://github.com/REASY/parquet-example-rs/blob/main/src/schema.rs#L5C12-L5C36): - timestamp, UInt64 - num_points, UInt32 - x, Array[Float32], size is 250 000 - y, Array[Float32], size is 250 000 - z, Array[Float32], size is 250 000 - intensity, Array[UInt8], size is 250 000 - ring, Array[UInt8], size is 250 000 **To Reproduce** - Fork https://github.com/REASY/parquet-example-rs - Run `cargo build --release && /usr/bin/time -pv target/release/parquet-example-rs --output-parquet-folder output --rows 8000 --statistics-mode page` - Check ` Maximum resident set size (kbytes)` from /usr/bin/time Curiously, if I run DHAT memory profiler, I do not see much difference in memory consumption, https://github.com/REASY/parquet-example-rs#memory-profiler When I trace the code, the only place where `EnabledStatistics::Page` is used is in https://github.com/apache/arrow-rs/blob/1d6feeacebb8d0d659d493b783ba381940973745/parquet/src/column/writer/encoder.rs#L139-L144 and not clear how it can cause so much allocation. **Expected behavior** **Additional context** Dependencies: ```cargo arrow = "47" clap = { version = "4", features = ["derive"] } dhat = "0.3.2" once_cell = "1.18.0" parquet = "47" rand = "0.8" ``` The comparison between three mode of statistics is done https://github.com/REASY/parquet-example-rs#page-statistics-consume-10x-more-memory-when-write-8000-rows, for the same number of rows check the table below: | Statistics mode | Number of rows | Total time, seconds | CPU usage, % | Average throughput, rows/s | Maximum resident set size, Mbytes | |-----------------|----------------|---------------------|--------------|----------------------------|-----------------------------------| | None | 8000 | 113.124 | 96 | 70.719 | 752.67 | | Chunk | 8000 | 128.318 | 97 | 62.345 | 790.96 | | Page | 8000 | 130.53 | 98 | 61.301 | 8516.36 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
