Re: [PR] fix: overcounting of memory in first/last. [datafusion]

via GitHub Tue, 06 May 2025 17:05:14 -0700


chenkovsky commented on code in PR #15924:
URL: https://github.com/apache/datafusion/pull/15924#discussion_r2076561244



##########
datafusion/common/src/scalar/mod.rs:
##########
@@ -3415,6 +3415,100 @@ impl ScalarValue {
                 .map(|sv| sv.size() - size_of_val(sv))
                 .sum::<usize>()
     }
+
+    /// Compacts the allocation referenced by `self` to the minimum, copying 
the data if
+    /// necessary.
+    ///
+    /// This can be relevant when `self` is a list or contains a list as a 
nested value, as
+    /// a single list holds an Arc to its entire original array buffer.
+    pub fn compact(&mut self) {
+        match self {
+            ScalarValue::Null
+            | ScalarValue::Boolean(_)
+            | ScalarValue::Float16(_)
+            | ScalarValue::Float32(_)
+            | ScalarValue::Float64(_)
+            | ScalarValue::Decimal128(_, _, _)
+            | ScalarValue::Decimal256(_, _, _)
+            | ScalarValue::Int8(_)
+            | ScalarValue::Int16(_)
+            | ScalarValue::Int32(_)
+            | ScalarValue::Int64(_)
+            | ScalarValue::UInt8(_)
+            | ScalarValue::UInt16(_)
+            | ScalarValue::UInt32(_)
+            | ScalarValue::UInt64(_)
+            | ScalarValue::Date32(_)
+            | ScalarValue::Date64(_)
+            | ScalarValue::Time32Second(_)
+            | ScalarValue::Time32Millisecond(_)
+            | ScalarValue::Time64Microsecond(_)
+            | ScalarValue::Time64Nanosecond(_)
+            | ScalarValue::IntervalYearMonth(_)
+            | ScalarValue::IntervalDayTime(_)
+            | ScalarValue::IntervalMonthDayNano(_)
+            | ScalarValue::DurationSecond(_)
+            | ScalarValue::DurationMillisecond(_)
+            | ScalarValue::DurationMicrosecond(_)
+            | ScalarValue::DurationNanosecond(_)
+            | ScalarValue::Utf8(_)
+            | ScalarValue::LargeUtf8(_)
+            | ScalarValue::Utf8View(_)
+            | ScalarValue::TimestampSecond(_, _)
+            | ScalarValue::TimestampMillisecond(_, _)
+            | ScalarValue::TimestampMicrosecond(_, _)
+            | ScalarValue::TimestampNanosecond(_, _)
+            | ScalarValue::Binary(_)
+            | ScalarValue::FixedSizeBinary(_, _)
+            | ScalarValue::LargeBinary(_)
+            | ScalarValue::BinaryView(_) => (),
+            ScalarValue::FixedSizeList(arr) => {
+                let taken = arrow::compute::take(
+                    arr.as_ref(),
+                    &Int32Array::from_iter_values([0]),
+                    None,
+                )
+                .unwrap();
+                *Arc::make_mut(arr) = 
cast::as_fixed_size_list_array(&taken).clone();

Review Comment:
   @alamb  
   
   in take's comments
   ```
   Note that this kernel, similar to other kernels in this crate,
   will avoid allocating where not necessary. Consequently
   the returned array may share buffers with the inputs
   ```
   If a large buffer is shared, it's still a problem.
   so I didnt choose this method. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix: overcounting of memory in first/last. [datafusion]

Reply via email to