[GitHub] [arrow-rs] dbr opened a new issue #623: Confusing memory usage with CSV reader

GitBox Tue, 27 Jul 2021 07:28:08 -0700


dbr opened a new issue #623:
URL: https://github.com/apache/arrow-rs/issues/623



   **Describe the bug**
   Using the `arrow::csv::ReaderBuilder` with something like 
`worldcitiespop_mil.csv` mentioned on [this 
page](https://github.com/BurntSushi/xsv/blob/master/BENCHMARKS.md)
   
   I was experimenting with the batch size setting in a standalone script, and 
it impacted the RAM usage in a surprising way:
   
   ```rust
   use arrow::record_batch::RecordBatch;
   use arrow::error::ArrowError;
   
   
   fn main() {
       let args: Vec<String> = std::env::args().collect();
       let fname = &args[1];
       dbg!(&fname);
       let batch_size: usize = args[2].parse().unwrap();
   
       let f = std::fs::File::open(&fname).unwrap();
       let reader = arrow::csv::ReaderBuilder::new()
           .infer_schema(Some(1_000))
           .has_header(true)
           .with_batch_size(batch_size)
           .build(f).unwrap();
   
           let mut total = 0;
       for r in reader {
           total += r.unwrap().num_rows();
       }
       dbg!(total);
       // let mut input = String::new(); std::io::stdin().read_line(&mut input);
   }
   ```
   
   If I run it like so:
   
       cargo +1.53 run --release -- ./worldcitiespop_mil.csv 10
   
   ..according to `top | grep arrcsv` the RAM usage is something like 5MB.
   
   If I increase `10` to `100,000` the RAM usage goes to maybe 30MB. Add 
another zero and the RAM usage is 255MB.
   
   Not being too familiar with arrow, I would have expected:
   1. Larger batch size may take more RAM while parsing, but more efficient 
storage
   2. Small batch size reduces RAM usage while parsing, but has more overhead 
(if it was 10% more I wouldn't be surprised)
   
   However the opposite seems to be true, and the usage seems kind of oddly 
high and, mainly, unpredictable.
   
   While making this minimal example, I had a thought that maybe the 
`arrow::csv::Reader` was still being kept around and it was using the memory, 
not the `Vec<RecordBatch>` - so I refactored it into a method, had it return 
the `RecordBatch` so the reader should have been dropped..
   
   ..but even more surprisingly, the memory usage drastically increased:
   
   ```rust
   use arrow::record_batch::RecordBatch;
   use arrow::error::ArrowError;
   
   fn hmm() -> Vec<Result<RecordBatch, ArrowError>> {
       let args: Vec<String> = std::env::args().collect();
       let fname = &args[1];
       let batch_size: usize = args[2].parse().unwrap();
   
       let f = std::fs::File::open(&fname).unwrap();
       let reader = arrow::csv::ReaderBuilder::new()
           .infer_schema(Some(1_000))
           .has_header(true)
           .with_batch_size(batch_size)
           .build(f).unwrap();
       reader.collect()
   }
   
   fn main() {
       let batches = hmm();
       let mut total = 0;
       for r in batches {
           total += r.unwrap().num_rows();
       }
       dbg!(total);
       //let mut input = String::new(); std::io::stdin().read_line(&mut input);
   }
   ```
   
   With this change:
   - With batch size of 1000, the RAM usage is now about 80MB (much higher than 
the ~5MB before)
   - With batch size of 1,000,000 the RAM usage is slightly higher (255MB -> 
300MB)
   - With very small batch size of 10, the RAM usage is about 630MB?!
   
   **To Reproduce**
   
   1. Create empty project with `main.rs` as one of my terrible lumps of code 
above. Only dependency is `arrow = "5.0.0"`
   2. Run the example with `cargo +1.53 --release -- ./worldcitiespop_mil.csv 
1000` etc
   3. Monitor RAM usage somehow (I was using the output from `top | grep ...` - 
thus the stdin-reading line in the code)
   
   **Expected behavior**
   Mostly covered above - but basically I'd expect the memory usage with all of 
these combinations to be "quite similar"
   
   **Additional context**
   I've not used arrow much, so it's very much possible I'm doing something 
strange or incorrect!
   
   Versions of stuff:
   - Linux (Debian Buster)
   - arrow 4.2 with Rust 1.51
   - Also: arrow 5.0 with Rust 1.53


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] dbr opened a new issue #623: Confusing memory usage with CSV reader

Reply via email to