dbr commented on issue #623:
URL: https://github.com/apache/arrow-rs/issues/623#issuecomment-887945890
> When you store them in a Vec instead of iterating over them (where they
will be dropped) you'll keep them in memory
Ahh, I think this was where a majority of my confusion was coming from - I
should have had something after the `read_line` which re-iterated over the
batches to be sure they weren't yet dropped
The only bit that remains a mystery to me is: why does a giant batch size
cause the process to use so much RAM?
With the tweaked example:
```rust
use arrow::record_batch::RecordBatch;
use arrow::error::ArrowError;
fn hmm() -> Vec<Result<RecordBatch, ArrowError>> {
let args: Vec<String> = std::env::args().collect();
let fname = &args[1];
let batch_size: usize = args[2].parse().unwrap();
let f = std::fs::File::open(&fname).unwrap();
let reader = arrow::csv::ReaderBuilder::new()
.infer_schema(Some(5_000))
.has_header(true)
.with_batch_size(batch_size)
.build(f).unwrap();
reader.collect()
}
fn main() {
let batches = hmm();
let mut total = 0;
let mut total_bytes = 0;
for r in &batches {
let batch = r.as_ref().unwrap();
for c in batch.columns() {
total_bytes += c.get_array_memory_size();
}
total += batch.num_rows();
}
dbg!(total);
dbg!(total_bytes);
// Delay to measure process RAM usage
let mut input = String::new();
std::io::stdin().read_line(&mut input);
// Repeat
for r in &batches {
let batch = r.as_ref().unwrap();
for c in batch.columns() {
total_bytes += c.get_array_memory_size();
}
total += batch.num_rows();
}
dbg!(2, total);
}
```
..I get the following results:
batch size | process RAM | sum of get_array_memory_size
-----------|-------------|-----------------------------
1,000,000 | 357.7m | 96,186,112
500,000 | 232.1m | 96,187,648
50,000 | 93.4m | 83,143,168
5,000 | 93.8m | 97,396,736
500 | 124.3m | 102,904,832
The size reported by `get_array_memory_size` matches exactly what you say -
overly small batch size starts to introduce some overhead with the duplicated
references and so on - but a pretty small difference (varies by <10% which
seems perfectly reasonable)
However the process RAM seems to do the inverse to what I'd expect - it's
like something is leaking from the parser, or an array is being over allocated,
or something like that?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]