ding-young commented on code in PR #15700: URL: https://github.com/apache/datafusion/pull/15700#discussion_r2218105116
########## datafusion/physical-plan/src/spill/get_size.rs: ########## @@ -0,0 +1,216 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use arrow::array::{ + Array, ArrayRef, ArrowPrimitiveType, BooleanArray, FixedSizeBinaryArray, + FixedSizeListArray, GenericByteArray, GenericListArray, OffsetSizeTrait, + PrimitiveArray, RecordBatch, StructArray, +}; +use arrow::buffer::{BooleanBuffer, Buffer, NullBuffer, OffsetBuffer}; +use arrow::datatypes::{ArrowNativeType, ByteArrayType}; +use arrow::downcast_primitive_array; +use arrow_schema::DataType; + +/// TODO - NEED TO MOVE THIS TO ARROW Review Comment: I've also experienced a case where using `batch.get_array_memory_size` leads to overestimation in aggregation. But that's not because arrow overallocates buffers, but because we **"slice"** the batch when we spill in `spill_record_batch_by_size_(and_return_max_batch_memory)`. Slicing the batch does not modify the buffer capacity, which `batch.get_array_memory_size` relies on, so therefore each sliced batch reports the full buffer memory usage, even though the buffers are shared between the slices. When running `test_single_mode_aggregate_single_mode_aggregate_with_spill()` in `aggregation_fuzz` with [reproducer branch](https://github.com/ding-young/datafusion/tree/batch-memory-reproducer), we can verify that naively relying on `batch.get_array_memory_size` with sliced batch may lead to overestimation. ``` ---- fuzz_cases::aggregate_fuzz::test_single_mode_aggregate_single_mode_aggregate_with_spill stdout ---- batch.get_actually_used_size:21492, batch.get_array_memory_size:178946 batch.get_actually_used_size:21624, batch.get_array_memory_size:178946 batch.get_actually_used_size:21635, batch.get_array_memory_size:178946 batch.get_actually_used_size:21620, batch.get_array_memory_size:178946 batch.get_actually_used_size:21636, batch.get_array_memory_size:178946 ... ``` Since sort does not use `spill_record_batch_by_size` api, so no slicing is involved, (I believe) only aggregation suffered from this issue. If slicing was the only cause for overestimation, I think we can just add special case handling for aggregation and keep it simple with given `batch.get_array_memory_size` api. I wonder if the overestimation @rluvaton encountered was caused by this issue, or if it was due to something else. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org