[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3622: feat: take kernel for RunArray

via GitHub Mon, 30 Jan 2023 04:19:35 -0800


tustvold commented on code in PR #3622:
URL: https://github.com/apache/arrow-rs/pull/3622#discussion_r1090552517



##########
arrow-select/src/take.rs:
##########
@@ -810,6 +816,70 @@ where
     Ok(DictionaryArray::<T>::from(data))
 }
 
+macro_rules! primitive_run_take {
+    ($t:ty, $o:ty, $indices:ident, $value:ident) => {
+        take_primitive_run_values::<$o, $t>(
+            $indices,
+            as_primitive_array::<$t>($value.values()),
+        )
+    };
+}
+
+/// `take` implementation for run arrays
+///
+/// performs binary search on `run_ends` to get physical indices for the given 
logical indices.
+/// builds output run array by taking values in the input run array at the 
physical indices.
+/// for e.g. an input `RunArray{ run_ends = [2,4,6,8], values=[1,2,1,2] }` and 
`indices=[2,7]`
+/// would be converted to `physical_indices=[1,3]` which will be used to build
+/// output `RunArray{ run_ends=[2], values=[2] }`
+
+pub fn take_run<T, I>(
+    run_array: &RunArray<T>,
+    logical_indices: &PrimitiveArray<I>,
+) -> Result<RunArray<T>, ArrowError>
+where
+    T: RunEndIndexType,
+    T::Native: num::Num,
+    I: ArrowPrimitiveType,
+    I::Native: ToPrimitive,
+{
+    match run_array.data_type() {
+        DataType::RunEndEncoded(_, fl) => {
+            let physical_indices =
+                run_array.get_physical_indices(logical_indices.values())?;
+            downcast_primitive! {
+                fl.data_type() => (primitive_run_take, T, physical_indices, 
run_array),
+                dt => Err(ArrowError::NotYetImplemented(format!("take_run is 
not implemented for {dt:?}")))
+            }
+        }
+        dt => Err(ArrowError::InvalidArgumentError(format!(
+            "Expected DataType::RunEndEncoded found {dt:?}"
+        ))),
+    }
+}
+// Builds a `RunArray` by taking values from given array for the given indices.
+fn take_primitive_run_values<R, V>(

Review Comment:
   Hmm... You raise a good point, I need to think on this. Ignoring codegen for 
a moment, this will be very expensive for things like strings... 
   
   Generally the approach of the arrow kernels is to be fast at the expense of 
memory, arrow is not a very compact representation regardless and is best only 
used for intermediate results.
   
   This is similar to an issue we have with dictionaries #3558 which I'm also 
currently mulling over 😅



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on a diff in pull request #3622: feat: take kernel for RunArray

Reply via email to