REASY opened a new issue, #4973:
URL: https://github.com/apache/arrow-rs/issues/4973

   **Describe the bug**
   When write Parquet using arrow-rs with large array columns, memory 
consumption in case of Page statistics 10x larger. The schema of Parquet file 
[has the following 
fields](https://github.com/REASY/parquet-example-rs/blob/main/src/schema.rs#L5C12-L5C36):
   - timestamp, UInt64
   - num_points, UInt32
   - x, Array[Float32], size is 250 000 
   - y, Array[Float32], size is 250 000 
   - z, Array[Float32], size is 250 000 
   - intensity, Array[UInt8], size is 250 000 
   - ring, Array[UInt8], size is 250 000 
   
   
   **To Reproduce**
   - Fork https://github.com/REASY/parquet-example-rs
   - Run `cargo build --release && /usr/bin/time -pv 
target/release/parquet-example-rs --output-parquet-folder output --rows 8000 
--statistics-mode page`
   - Check ` Maximum resident set size (kbytes)`  from /usr/bin/time
   
   Curiously, if I run DHAT memory profiler, I do not see much difference in 
memory consumption, https://github.com/REASY/parquet-example-rs#memory-profiler
   
   When I trace the code, the only place where `EnabledStatistics::Page` is 
used is in 
https://github.com/apache/arrow-rs/blob/1d6feeacebb8d0d659d493b783ba381940973745/parquet/src/column/writer/encoder.rs#L139-L144
 and not clear how it can cause so much allocation.
   
   **Expected behavior**
   
   
   **Additional context**
   Dependencies:
   ```cargo
   arrow = "47"
   clap = { version = "4", features = ["derive"] }
   dhat = "0.3.2"
   once_cell = "1.18.0"
   parquet = "47"
   rand = "0.8"
   ```
   
   The comparison between three mode of statistics is done 
https://github.com/REASY/parquet-example-rs#page-statistics-consume-10x-more-memory-when-write-8000-rows,
 for the same number of rows check the table below:
   
   | Statistics mode | Number of rows | Total time, seconds | CPU usage, % | 
Average throughput, rows/s | Maximum resident set size, Mbytes |
   
|-----------------|----------------|---------------------|--------------|----------------------------|-----------------------------------|
   | None            | 8000           | 113.124             | 96           | 
70.719                     | 752.67                            |
   | Chunk           | 8000           | 128.318             | 97           | 
62.345                     | 790.96                            |
   | Page            | 8000           | 130.53              | 98           | 
61.301                     | 8516.36                           |
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to