yjshen commented on code in PR #9481:
URL: https://github.com/apache/arrow-datafusion/pull/9481#discussion_r1518816605


##########
datafusion/common/src/utils.rs:
##########
@@ -679,12 +679,32 @@ pub fn find_indices<T: PartialEq, S: Borrow<T>>(
         .ok_or_else(|| DataFusionError::Execution("Target not 
found".to_string()))
 }
 
+pub trait EffectiveSize {
+    fn get_effective_memory_size(&self) -> usize;
+}
+
+impl EffectiveSize for ArrayRef {
+    fn get_effective_memory_size(&self) -> usize {
+        self.to_data().get_slice_memory_size().unwrap_or(0)
+    }
+}
+
+impl EffectiveSize for RecordBatch {
+    fn get_effective_memory_size(&self) -> usize {

Review Comment:
   By checking the current codebase, I think we are using the slice for several 
cases:
   1. Slice large/mono batch into slices based on batch_size conf. (agg output, 
file scan output, join output)
   2. Partitioning existing batch based on certain criteria. (window/sort batch 
partitioning)
   3. limit (Is this the only one that only part of the batch can be used?)
   
   For both cases 1 and 2, since we will use all sliced small batches in 
subsequent operators, using the effective size would make overreporting more 
reasonable, I suppose. The overreport be reduced from N times to 2 times 
compared to the actual batch size?
   
   For case 3, you are correct that it will result in underreporting, but 
considering that a limit operation would most likely be on the upper level of a 
DAG, would memory usage be less of a concern?
   
   Another approach to avoid overreporting could be adding a tag to the 
slice-generated batch. It should report size 0 for the sliced batch since the 
original, large batch should already be memory-tracked elsewhere.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to