[GitHub] [arrow-rs] dbr commented on issue #623: Confusing memory usage with CSV reader

GitBox Tue, 27 Jul 2021 18:45:08 -0700


dbr commented on issue #623:
URL: https://github.com/apache/arrow-rs/issues/623#issuecomment-887945890



   > When you store them in a Vec instead of iterating over them (where they 
will be dropped) you'll keep them in memory
   
   Ahh, I think this was where a majority of my confusion was coming from - I 
should have had something after the `read_line` which re-iterated over the 
batches to be sure they weren't yet dropped
   
   The only bit that remains a mystery to me is: why does a giant batch size 
cause the process to use so much RAM?
   
   With the tweaked example:
   
   ```rust
   use arrow::record_batch::RecordBatch;
   use arrow::error::ArrowError;
   
   fn hmm() -> Vec<Result<RecordBatch, ArrowError>> {
       let args: Vec<String> = std::env::args().collect();
       let fname = &args[1];
       let batch_size: usize = args[2].parse().unwrap();
   
       let f = std::fs::File::open(&fname).unwrap();
       let reader = arrow::csv::ReaderBuilder::new()
           .infer_schema(Some(5_000))
           .has_header(true)
           .with_batch_size(batch_size)
           .build(f).unwrap();
       reader.collect()
   }
   
   fn main() {
       let batches = hmm();
       let mut total = 0;
       let mut total_bytes = 0;
       for r in &batches {
           let batch = r.as_ref().unwrap();
           for c in batch.columns() {
               total_bytes += c.get_array_memory_size();
           }
           total += batch.num_rows();
       }
       dbg!(total);
       dbg!(total_bytes);
   
       // Delay to measure process RAM usage
       let mut input = String::new();
       std::io::stdin().read_line(&mut input);
   
       // Repeat
       for r in &batches {
           let batch = r.as_ref().unwrap();
           for c in batch.columns() {
               total_bytes += c.get_array_memory_size();
           }
           total += batch.num_rows();
       }
       dbg!(2, total);
   }
   ```
   
   ..I get the following results:
   
   batch size | process RAM | sum of get_array_memory_size
   -----------|-------------|-----------------------------
   1,000,000  | 357.7m      | 96,186,112
   500,000    | 232.1m      | 96,187,648
   50,000     | 93.4m       | 83,143,168
   5,000      | 93.8m       | 97,396,736
   500        | 124.3m      | 102,904,832
   
   The size reported by `get_array_memory_size` matches exactly what you say - 
overly small batch size starts to introduce some overhead with the duplicated 
references and so on - but a pretty small difference (varies by <10% which 
seems perfectly reasonable)
   
   However the process RAM seems to do the inverse to what I'd expect - it's 
like something is leaking from the parser, or an array is being over allocated, 
or something like that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] dbr commented on issue #623: Confusing memory usage with CSV reader

Reply via email to