alamb opened a new issue, #4645:
URL: https://github.com/apache/arrow-rs/issues/4645

   **Describe the bug**
   When merging  GB of high cardinality dictionary data, the size reported by 
the RowInterner is a signficiant (GB) undercount
   
   This leads to our system to significant exceed its configured memory limit 
in several cases. 
   
   I believe the bug is that the `Bucket::size()` does not account for size of 
embedded `Bucket` in `Slot`. I will make a PR shortly
   
   **To Reproduce**
   I can reproduce this when merge GB of high cardinality proprietary 
dictionary encoded data 
   
   I tried to make a unit test but I could not figure out how to. Any thoughts 
would be appreciated
   
   ```
       #[test]
       fn test_intern_sizes() {
           let mut interner = OrderPreservingInterner::default();
   
           // Intern a 1M values each 10 bytes large, and the interner
           // should report at least 10MB bytes
           // ...
           let num_items = 3000;
           let mut values: Vec<usize> = (0..num_items).collect();
           values.reverse();
   
           interner.intern(values.iter().map(|v| Some(v.to_be_bytes())));
           let actual_size = interner.size();
           let min_expected_size =
               // at least space for each item
               num_items * std::mem::size_of::<usize>()
               // at least one slot for each item
               + num_items * std::mem::size_of::<Slot>();
   
           println!("Actual size: {actual_size}, min {min_expected_size}");
   
           assert!(actual_size > min_expected_size,
                   "actual_size {actual_size} not larger than 
min_expected_size: {min_expected_size}")
       }
   
   ```
   
   **Expected behavior**
   <!--
   A clear and concise description of what you expected to happen.
   -->
   
   **Additional context**
   I found this while testing 
https://github.com/apache/arrow-datafusion/pull/7130 with our internal data -- 
it did not reduce memory requirements the way I expected. I tracked the root 
cause down to this 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to