[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4989: Add support for linear range calculation in WINDOW functions

GitBox Fri, 20 Jan 2023 05:54:39 -0800


alamb commented on code in PR #4989:
URL: https://github.com/apache/arrow-datafusion/pull/4989#discussion_r1082548503



##########
datafusion/common/src/utils.rs:
##########
@@ -22,8 +22,16 @@ use arrow::array::ArrayRef;
 use arrow::compute::SortOptions;
 use std::cmp::Ordering;
 
+/// Given column vectors, returns row at `idx`.
+pub fn get_row_at_idx(columns: &[ArrayRef], idx: usize) -> 
Result<Vec<ScalarValue>> {
+    columns
+        .iter()
+        .map(|arr| ScalarValue::try_from_array(arr, idx))
+        .collect()
+}
+
 /// This function compares two tuples depending on the given sort options.
-fn compare(
+pub fn compare_rows(

Review Comment:
   maybe `pub crate` would be better here unless there is some reason to export 
it



##########
datafusion/common/src/utils.rs:
##########
@@ -103,6 +111,53 @@ where
     Ok(low)
 }
 
+/// This function searches for a tuple of given values (`target`) among the 
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is 
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search<const SIDE: bool>(
+    item_columns: &[ArrayRef],
+    target: &[ScalarValue],
+    sort_options: &[SortOptions],
+) -> Result<usize> {
+    let low: usize = 0;
+    let high: usize = item_columns
+        .get(0)
+        .ok_or_else(|| {
+            DataFusionError::Internal("Column array shouldn't be 
empty".to_string())
+        })?
+        .len();
+    let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {

Review Comment:
   FWIW the arrow kernels have  
https://docs.rs/arrow/31.0.0/arrow/compute/struct.LexicographicalComparator.html
 which you might be able to use and avoid having to construct ScalarValues to 
compare.



##########
datafusion/common/src/utils.rs:
##########
@@ -103,6 +111,53 @@ where
     Ok(low)
 }
 
+/// This function searches for a tuple of given values (`target`) among the 
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is 
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search<const SIDE: bool>(
+    item_columns: &[ArrayRef],
+    target: &[ScalarValue],
+    sort_options: &[SortOptions],
+) -> Result<usize> {
+    let low: usize = 0;
+    let high: usize = item_columns
+        .get(0)
+        .ok_or_else(|| {
+            DataFusionError::Internal("Column array shouldn't be 
empty".to_string())
+        })?
+        .len();
+    let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {
+        let cmp = compare_rows(current, target, sort_options)?;
+        Ok(if SIDE { cmp.is_lt() } else { cmp.is_le() })
+    };
+    search_in_slice(item_columns, target, compare_fn, low, high)
+}
+
+/// This function searches for a tuple of given values (`target`) among a 
slice of
+/// the given rows (`item_columns`) via a linear scan. The slice starts at the 
index
+/// `low` and ends at the index `high`. The boolean-valued function 
`compare_fn`
+/// specifies the stopping criterion.
+pub fn search_in_slice<F>(
+    item_columns: &[ArrayRef],
+    target: &[ScalarValue],
+    compare_fn: F,
+    mut low: usize,
+    high: usize,
+) -> Result<usize>
+where
+    F: Fn(&[ScalarValue], &[ScalarValue]) -> Result<bool>,
+{
+    while low < high {

Review Comment:
   I don't know if this would be faster or not, but I winder if you might be 
able to use `.iter()` and `find()` to find the relevant index and possibly 
avoid the bounds checks 🤔 
   
   Like if you could do something like 
`item_columns.iter().enumerate().find(compare_fn).map(|(i, _)| i).unwrap()`
   



##########
datafusion/common/src/utils.rs:
##########
@@ -22,8 +22,16 @@ use arrow::array::ArrayRef;
 use arrow::compute::SortOptions;
 use std::cmp::Ordering;
 
+/// Given column vectors, returns row at `idx`.
+pub fn get_row_at_idx(columns: &[ArrayRef], idx: usize) -> 
Result<Vec<ScalarValue>> {
+    columns
+        .iter()
+        .map(|arr| ScalarValue::try_from_array(arr, idx))

Review Comment:
   Longer term, I think there is quite a bit more performance to be gained by 
avoiding the use of `ScalarValue` here and instead using something more 
optimized like the RowFormat. This is not something for this PR I am just 
trying to plan a seed if we need more performance, there is a path



##########
datafusion/common/src/utils.rs:
##########
@@ -103,6 +111,53 @@ where
     Ok(low)
 }
 
+/// This function searches for a tuple of given values (`target`) among the 
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is 
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search<const SIDE: bool>(

Review Comment:
   I didn't see `linear_search` used anywhere in this PR other than tests  -- I 
wonder if the intention was to call it in 
`datafusion/physical-expr/src/window/window_frame_state.rs` ? It looks like 
that currently has an inlined version of this function 🤔 



##########
datafusion/physical-expr/src/window/partition_evaluator.rs:
##########
@@ -83,7 +83,7 @@ pub trait PartitionEvaluator: Debug + Send {
     fn evaluate_inside_range(
         &self,
         _values: &[ArrayRef],
-        _range: Range<usize>,
+        _range: &Range<usize>,

Review Comment:
   👍 



##########
datafusion/physical-expr/src/window/window_frame_state.rs:
##########
@@ -285,8 +301,16 @@ impl WindowFrameStateRange {
         } else {
             current_row_values
         };
-        // `BISECT_SIDE` true means bisect_left, false means bisect_right
-        bisect::<BISECT_SIDE>(range_columns, &end_range, sort_options)
+        let search_start = if SIDE {
+            last_range.start
+        } else {
+            last_range.end
+        };
+        let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {

Review Comment:
   See the above comment about 
https://docs.rs/arrow/31.0.0/arrow/compute/struct.LexicographicalComparator.html
 possibly being another way to improve performance



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #4989: Add support for linear range calculation in WINDOW functions

Reply via email to