[I] parquet arrow writer doesn't track memory size correctly for fixed sized lists [arrow-rs]

via GitHub Thu, 05 Dec 2024 13:48:19 -0800


kszlim opened a new issue, #6839:
URL: https://github.com/apache/arrow-rs/issues/6839


   **Describe the bug**
   The arrow writer doesn't track memory size correctly, and it seems like it 
thinks `FixedSizeList` columns have a fixed memory usage. Ie. the reported 
memory usage doesn't grow despite the buffers actually growing in memory.
   
   **To Reproduce**
   ```toml
   [package]
   name = "repro"
   version = "0.1.0"
   edition = "2021"
   
   [dependencies]
   arrow = "53.3.0"
   parquet = "53.3.0"
   rand = "0.8.5"
   ```
   
   ```rust
   use arrow::array::{FixedSizeListBuilder, UInt8Builder};
   use arrow::datatypes::{DataType, Field, Schema};
   use arrow::record_batch::RecordBatch;
   use parquet::arrow::ArrowWriter;
   use parquet::file::properties::WriterProperties;
   use rand::Rng;
   use std::fs::File;
   use std::sync::Arc;
   
   fn main() -> Result<(), Box<dyn std::error::Error>> {
       // Define the field and schema for a single column that is a fixed-size 
list of floats.
       let list_length = 1_048_576;
       let field = Field::new(
           "mylist",
           DataType::FixedSizeList(Arc::new(Field::new("item", DataType::UInt8, 
true)), list_length),
           true,
       );
       let schema = Arc::new(Schema::new(vec![field]));
   
       // Create a writer for the Parquet file
       let file = File::create("output_randomized.parquet")?;
       let props = WriterProperties::builder().build();
       let mut writer = ArrowWriter::try_new(file, schema.clone(), 
Some(props))?;
   
       let iterations = 10000;
       let values_per_batch = list_length;
   
       let mut list_arr_builder = 
FixedSizeListBuilder::new(UInt8Builder::new(), list_length);
       for _ in 0..iterations {
           // Generate random data for the values array
           let mut rng = rand::thread_rng();
           let values: Vec<u8> = (0..values_per_batch)
               .map(|_| rng.gen())
               .collect();
   
           list_arr_builder.values().append_slice(&values);
           list_arr_builder.append(true);
           let output = list_arr_builder.finish();
           let batch = RecordBatch::try_new(schema.clone(), 
vec![Arc::new(output)])?;
           let in_memory_size = writer.memory_size() + 
writer.in_progress_size();
           let before_in_memory_size_mb = (in_memory_size as f64) / 
(1024f64.powi(2));
           writer.write(&batch)?;
           let in_memory_size = writer.memory_size() + 
writer.in_progress_size();
           let after_in_memory_size_mb = (in_memory_size as f64) / 
(1024f64.powi(2));
           let change_in_usage = before_in_memory_size_mb - 
after_in_memory_size_mb;
           dbg!(change_in_usage, after_in_memory_size_mb, 
before_in_memory_size_mb);
       }
   
       writer.close()?;
       println!("Wrote 10000 record batches with randomized data to 
output_randomized.parquet");
   
       Ok(())
   }
   ```
   **Expected behavior**
   We should see the reported memory usage rise over time, then as flush is 
triggered, it should go down to around zero. Then repeat.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] parquet arrow writer doesn't track memory size correctly for fixed sized lists [arrow-rs]

Reply via email to