dbr opened a new issue #623: URL: https://github.com/apache/arrow-rs/issues/623
**Describe the bug** Using the `arrow::csv::ReaderBuilder` with something like `worldcitiespop_mil.csv` mentioned on [this page](https://github.com/BurntSushi/xsv/blob/master/BENCHMARKS.md) I was experimenting with the batch size setting in a standalone script, and it impacted the RAM usage in a surprising way: ```rust use arrow::record_batch::RecordBatch; use arrow::error::ArrowError; fn main() { let args: Vec<String> = std::env::args().collect(); let fname = &args[1]; dbg!(&fname); let batch_size: usize = args[2].parse().unwrap(); let f = std::fs::File::open(&fname).unwrap(); let reader = arrow::csv::ReaderBuilder::new() .infer_schema(Some(1_000)) .has_header(true) .with_batch_size(batch_size) .build(f).unwrap(); let mut total = 0; for r in reader { total += r.unwrap().num_rows(); } dbg!(total); // let mut input = String::new(); std::io::stdin().read_line(&mut input); } ``` If I run it like so: cargo +1.53 run --release -- ./worldcitiespop_mil.csv 10 ..according to `top | grep arrcsv` the RAM usage is something like 5MB. If I increase `10` to `100,000` the RAM usage goes to maybe 30MB. Add another zero and the RAM usage is 255MB. Not being too familiar with arrow, I would have expected: 1. Larger batch size may take more RAM while parsing, but more efficient storage 2. Small batch size reduces RAM usage while parsing, but has more overhead (if it was 10% more I wouldn't be surprised) However the opposite seems to be true, and the usage seems kind of oddly high and, mainly, unpredictable. While making this minimal example, I had a thought that maybe the `arrow::csv::Reader` was still being kept around and it was using the memory, not the `Vec<RecordBatch>` - so I refactored it into a method, had it return the `RecordBatch` so the reader should have been dropped.. ..but even more surprisingly, the memory usage drastically increased: ```rust use arrow::record_batch::RecordBatch; use arrow::error::ArrowError; fn hmm() -> Vec<Result<RecordBatch, ArrowError>> { let args: Vec<String> = std::env::args().collect(); let fname = &args[1]; let batch_size: usize = args[2].parse().unwrap(); let f = std::fs::File::open(&fname).unwrap(); let reader = arrow::csv::ReaderBuilder::new() .infer_schema(Some(1_000)) .has_header(true) .with_batch_size(batch_size) .build(f).unwrap(); reader.collect() } fn main() { let batches = hmm(); let mut total = 0; for r in batches { total += r.unwrap().num_rows(); } dbg!(total); //let mut input = String::new(); std::io::stdin().read_line(&mut input); } ``` With this change: - With batch size of 1000, the RAM usage is now about 80MB (much higher than the ~5MB before) - With batch size of 1,000,000 the RAM usage is slightly higher (255MB -> 300MB) - With very small batch size of 10, the RAM usage is about 630MB?! **To Reproduce** 1. Create empty project with `main.rs` as one of my terrible lumps of code above. Only dependency is `arrow = "5.0.0"` 2. Run the example with `cargo +1.53 --release -- ./worldcitiespop_mil.csv 1000` etc 3. Monitor RAM usage somehow (I was using the output from `top | grep ...` - thus the stdin-reading line in the code) **Expected behavior** Mostly covered above - but basically I'd expect the memory usage with all of these combinations to be "quite similar" **Additional context** I've not used arrow much, so it's very much possible I'm doing something strange or incorrect! Versions of stuff: - Linux (Debian Buster) - arrow 4.2 with Rust 1.51 - Also: arrow 5.0 with Rust 1.53 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
