kszlim opened a new issue, #6839:
URL: https://github.com/apache/arrow-rs/issues/6839
**Describe the bug**
The arrow writer doesn't track memory size correctly, and it seems like it
thinks `FixedSizeList` columns have a fixed memory usage. Ie. the reported
memory usage doesn't grow despite the buffers actually growing in memory.
**To Reproduce**
```toml
[package]
name = "repro"
version = "0.1.0"
edition = "2021"
[dependencies]
arrow = "53.3.0"
parquet = "53.3.0"
rand = "0.8.5"
```
```rust
use arrow::array::{FixedSizeListBuilder, UInt8Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
use rand::Rng;
use std::fs::File;
use std::sync::Arc;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Define the field and schema for a single column that is a fixed-size
list of floats.
let list_length = 1_048_576;
let field = Field::new(
"mylist",
DataType::FixedSizeList(Arc::new(Field::new("item", DataType::UInt8,
true)), list_length),
true,
);
let schema = Arc::new(Schema::new(vec![field]));
// Create a writer for the Parquet file
let file = File::create("output_randomized.parquet")?;
let props = WriterProperties::builder().build();
let mut writer = ArrowWriter::try_new(file, schema.clone(),
Some(props))?;
let iterations = 10000;
let values_per_batch = list_length;
let mut list_arr_builder =
FixedSizeListBuilder::new(UInt8Builder::new(), list_length);
for _ in 0..iterations {
// Generate random data for the values array
let mut rng = rand::thread_rng();
let values: Vec<u8> = (0..values_per_batch)
.map(|_| rng.gen())
.collect();
list_arr_builder.values().append_slice(&values);
list_arr_builder.append(true);
let output = list_arr_builder.finish();
let batch = RecordBatch::try_new(schema.clone(),
vec![Arc::new(output)])?;
let in_memory_size = writer.memory_size() +
writer.in_progress_size();
let before_in_memory_size_mb = (in_memory_size as f64) /
(1024f64.powi(2));
writer.write(&batch)?;
let in_memory_size = writer.memory_size() +
writer.in_progress_size();
let after_in_memory_size_mb = (in_memory_size as f64) /
(1024f64.powi(2));
let change_in_usage = before_in_memory_size_mb -
after_in_memory_size_mb;
dbg!(change_in_usage, after_in_memory_size_mb,
before_in_memory_size_mb);
}
writer.close()?;
println!("Wrote 10000 record batches with randomized data to
output_randomized.parquet");
Ok(())
}
```
**Expected behavior**
We should see the reported memory usage rise over time, then as flush is
triggered, it should go down to around zero. Then repeat.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]