Jefffrey commented on code in PR #9036:
URL: https://github.com/apache/arrow-rs/pull/9036#discussion_r2647727740
##########
arrow-select/src/filter.rs:
##########
@@ -495,14 +495,21 @@ where
let filter_values = predicate.filter.values();
let run_ends = run_ends.inner();
+ let offset = array.offset() as u64;
+ let slice_end = (offset + array.len() as u64).min(offset +
filter_values.len() as u64);
+
let pred: BooleanArray = BooleanBuffer::collect_bool(run_ends.len(), |i| {
let mut keep = false;
- let mut end = run_ends[i].into() as u64;
- let difference = end.saturating_sub(filter_values.len() as u64);
- end -= difference;
+ let end = run_ends[i].into() as u64;
+
+ let loop_start = start.max(offset);
+ let loop_end = end.min(slice_end);
Review Comment:
This min/max logic doesn't seem intuitive to me; from what I understand it
is because we still iterate through the unsliced parts of the runarray, but
this min/max guards us to only check the sliced region. Am I correct in my
understanding here?
If so I wonder can we refactor this to only consider the sliced region from
the outset?
##########
arrow-array/src/array/run_array.rs:
##########
@@ -1128,4 +1138,35 @@ mod tests {
assert_eq!(array_i16_1, array_i16_2);
}
+
+ #[test]
+ fn test_run_array_values_slice() {
+ // A, A, B, B, B, C...C (15 Cs)
+ let run_ends: PrimitiveArray<Int32Type> = vec![2, 5, 20].into();
+ let values: PrimitiveArray<Int32Type> = vec![0, 1, 2].into();
Review Comment:
It's confusing to have the comments refer to values as A,B,C but the values
are instead 0,1,2
Maybe it's better to have the values as an actual string array to mirror the
comments
##########
arrow-buffer/src/buffer/run.rs:
##########
@@ -189,6 +189,16 @@ where
&self.run_ends
}
+ /// Returns the slice of the run-ends array that corresponds to the
+ /// physical indices of this run-end buffer.
+ ///
+ /// This is equivalent to
`&self.values()[self.get_start_physical_index()..self.get_end_physical_index()
+ 1]`
Review Comment:
```suggestion
/// Similar to [`values`] but accounts for logical slicing, returning
only the values
/// that are part of the logical slice of this buffer.
///
/// [`values`]: Self::values
```
Same here
##########
arrow-select/src/filter.rs:
##########
@@ -495,14 +495,21 @@ where
let filter_values = predicate.filter.values();
let run_ends = run_ends.inner();
+ let offset = array.offset() as u64;
+ let slice_end = (offset + array.len() as u64).min(offset +
filter_values.len() as u64);
Review Comment:
Is this min strictly necessary? I don't think it should be possible for
`filter_values` length to exceed our RunArray length
##########
arrow-select/src/concat.rs:
##########
@@ -377,20 +381,30 @@ where
PrimitiveArray::<R>::from_iter_values(run_arrays.iter().enumerate().flat_map(
move |(i, run_array)| {
let adjustment = needed_run_end_adjustments[i];
+ let offset =
R::Native::from_usize(run_array.offset()).unwrap();
+ let length = R::Native::from_usize(run_array.len()).unwrap();
+ let start = run_array.get_start_physical_index();
+ let len = run_array.get_end_physical_index() - start + 1;
+
run_array
.run_ends()
.values()
.iter()
- .map(move |run_end| *run_end + adjustment)
+ .skip(start)
+ .take(len)
Review Comment:
Couldn't we use the new method that was introduced to return only the sliced
values here?
##########
arrow-array/src/array/run_array.rs:
##########
@@ -136,6 +136,16 @@ impl<R: RunEndIndexType> RunArray<R> {
&self.values
}
+ /// Returns a slice of the values array that corresponds to the
+ /// physical indices of this run-end encoded array.
+ ///
+ /// This is equivalent to
`self.values().slice(self.get_start_physical_index(),
self.get_end_physical_index() - self.get_start_physical_index() + 1)`
Review Comment:
```suggestion
/// Similar to [`values`] but accounts for logical slicing, returning
only the values
/// that are part of the logical slice of this array.
///
/// [`values`]: Self::values
```
No need to include details on what it is equivalent to; people can simply
check the source code for that
##########
arrow-buffer/src/buffer/run.rs:
##########
@@ -189,6 +189,16 @@ where
&self.run_ends
}
+ /// Returns the slice of the run-ends array that corresponds to the
+ /// physical indices of this run-end buffer.
+ ///
+ /// This is equivalent to
`&self.values()[self.get_start_physical_index()..self.get_end_physical_index()
+ 1]`
+ pub fn run_ends_slice(&self) -> &[E] {
Review Comment:
It's confusing to name this `run_ends_slice` considering the other method
here is named `values`
##########
arrow-select/src/concat.rs:
##########
@@ -377,20 +381,30 @@ where
PrimitiveArray::<R>::from_iter_values(run_arrays.iter().enumerate().flat_map(
move |(i, run_array)| {
let adjustment = needed_run_end_adjustments[i];
+ let offset =
R::Native::from_usize(run_array.offset()).unwrap();
+ let length = R::Native::from_usize(run_array.len()).unwrap();
+ let start = run_array.get_start_physical_index();
+ let len = run_array.get_end_physical_index() - start + 1;
+
run_array
.run_ends()
.values()
.iter()
- .map(move |run_end| *run_end + adjustment)
+ .skip(start)
+ .take(len)
+ .map(move |run_end| {
+ let value = *run_end - offset;
+ value.min(length) + adjustment
Review Comment:
Why is this min necessary?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]