yjshen commented on code in PR #9481:
URL: https://github.com/apache/arrow-datafusion/pull/9481#discussion_r1518816605
##########
datafusion/common/src/utils.rs:
##########
@@ -679,12 +679,32 @@ pub fn find_indices<T: PartialEq, S: Borrow<T>>(
.ok_or_else(|| DataFusionError::Execution("Target not
found".to_string()))
}
+pub trait EffectiveSize {
+ fn get_effective_memory_size(&self) -> usize;
+}
+
+impl EffectiveSize for ArrayRef {
+ fn get_effective_memory_size(&self) -> usize {
+ self.to_data().get_slice_memory_size().unwrap_or(0)
+ }
+}
+
+impl EffectiveSize for RecordBatch {
+ fn get_effective_memory_size(&self) -> usize {
Review Comment:
By checking the current codebase, I think we are using the slice for several
cases:
1. Slice large/mono batch into slices based on batch_size conf. (agg output,
file scan output, join output)
2. Partitioning existing batch based on certain criteria. (window/sort batch
partitioning)
3. limit (Is this the only one that only part of the batch can be used?)
For both cases 1 and 2, since we will use all sliced small batches in
subsequent operators, using the effective size would make overreporting more
reasonable, I suppose. The overreport be reduced from N times to 2 times
compared to the actual batch size?
For case 3, you are correct that it will result in underreporting, but
considering that a limit operation would most likely be on the upper level of a
DAG, would memory usage be less of a concern?
Another approach to avoid overreporting could be adding a tag to the
slice-generated batch. It should report size 0 for the sliced batch since the
original, large batch should already be memory-tracked elsewhere.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]