asubiotto commented on code in PR #9856:
URL: https://github.com/apache/arrow-rs/pull/9856#discussion_r3188818551
##########
arrow-select/src/interleave.rs:
##########
@@ -411,6 +417,70 @@ fn interleave_list<O: OffsetSizeTrait>(
Ok(Arc::new(list_array))
}
+/// Specialized [`interleave`] for [`RunArray`].
+fn interleave_run_end<R: RunEndIndexType>(
+ values: &[&dyn Array],
+ indices: &[(usize, usize)],
+) -> Result<ArrayRef, ArrowError> {
+ if indices.is_empty() {
+ return Ok(new_empty_array(values[0].data_type()));
+ }
+
+ let n = indices.len();
+ R::Native::from_usize(n).ok_or_else(|| {
+ ArrowError::ComputeError(format!(
+ "interleave_run_end: output length {n} does not fit run-end type"
+ ))
+ })?;
+
+ let runs: Vec<&RunArray<R>> = values.iter().map(|a|
a.as_run::<R>()).collect();
+ let value_arrays: Vec<&dyn Array> = runs.iter().map(|r|
r.values().as_ref()).collect();
+
+ // Resolve each (array, logical_row) to (array, physical_row), so we can
+ // lookup physical indices by batch.
+ let mut phys_pairs: Vec<(usize, usize)> = vec![(0, 0); n];
+ let mut grouped: Vec<(Vec<R::Native>, Vec<usize>)> =
+ (0..runs.len()).map(|_| (Vec::new(), Vec::new())).collect();
+ for (out_pos, &(arr, row)) in indices.iter().enumerate() {
+ let row = R::Native::from_usize(row).ok_or_else(|| {
+ ArrowError::InvalidArgumentError(format!(
+ "interleave_run_end: row index {row} out of range"
+ ))
+ })?;
+ grouped[arr].0.push(row);
+ grouped[arr].1.push(out_pos);
+ }
+ for (arr_idx, (logical_rows, out_positions)) in
grouped.into_iter().enumerate() {
+ let phys = runs[arr_idx].get_physical_indices(&logical_rows)?;
+ for (p, out_pos) in phys.iter().zip(out_positions.iter()) {
+ phys_pairs[*out_pos] = (arr_idx, *p);
+ }
+ }
+
+ // Coalesce by physical-pair equality only: emit a new run when the
+ // (array_idx, physical_idx) pair changes between adjacent output rows.
+ // TODO: We could perform an equality check across sources to extend the
Review Comment:
Yes, exactly. That PR would make sense in this block so we don't compact in
the interleave fallback. This also means that the equality cost is only paid
when interleave pairs select from different input run arrays (assumption is
input run arrays are well formed). I'm concerned about the per-row slicing cost
though. I think ideally you would have a cache of comparators but I believe
that require some crate readjusting.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]