alamb commented on code in PR #4989:
URL: https://github.com/apache/arrow-datafusion/pull/4989#discussion_r1082548503
##########
datafusion/common/src/utils.rs:
##########
@@ -22,8 +22,16 @@ use arrow::array::ArrayRef;
use arrow::compute::SortOptions;
use std::cmp::Ordering;
+/// Given column vectors, returns row at `idx`.
+pub fn get_row_at_idx(columns: &[ArrayRef], idx: usize) ->
Result<Vec<ScalarValue>> {
+ columns
+ .iter()
+ .map(|arr| ScalarValue::try_from_array(arr, idx))
+ .collect()
+}
+
/// This function compares two tuples depending on the given sort options.
-fn compare(
+pub fn compare_rows(
Review Comment:
maybe `pub crate` would be better here unless there is some reason to export
it
##########
datafusion/common/src/utils.rs:
##########
@@ -103,6 +111,53 @@ where
Ok(low)
}
+/// This function searches for a tuple of given values (`target`) among the
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search<const SIDE: bool>(
+ item_columns: &[ArrayRef],
+ target: &[ScalarValue],
+ sort_options: &[SortOptions],
+) -> Result<usize> {
+ let low: usize = 0;
+ let high: usize = item_columns
+ .get(0)
+ .ok_or_else(|| {
+ DataFusionError::Internal("Column array shouldn't be
empty".to_string())
+ })?
+ .len();
+ let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {
Review Comment:
FWIW the arrow kernels have
https://docs.rs/arrow/31.0.0/arrow/compute/struct.LexicographicalComparator.html
which you might be able to use and avoid having to construct ScalarValues to
compare.
##########
datafusion/common/src/utils.rs:
##########
@@ -103,6 +111,53 @@ where
Ok(low)
}
+/// This function searches for a tuple of given values (`target`) among the
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search<const SIDE: bool>(
+ item_columns: &[ArrayRef],
+ target: &[ScalarValue],
+ sort_options: &[SortOptions],
+) -> Result<usize> {
+ let low: usize = 0;
+ let high: usize = item_columns
+ .get(0)
+ .ok_or_else(|| {
+ DataFusionError::Internal("Column array shouldn't be
empty".to_string())
+ })?
+ .len();
+ let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {
+ let cmp = compare_rows(current, target, sort_options)?;
+ Ok(if SIDE { cmp.is_lt() } else { cmp.is_le() })
+ };
+ search_in_slice(item_columns, target, compare_fn, low, high)
+}
+
+/// This function searches for a tuple of given values (`target`) among a
slice of
+/// the given rows (`item_columns`) via a linear scan. The slice starts at the
index
+/// `low` and ends at the index `high`. The boolean-valued function
`compare_fn`
+/// specifies the stopping criterion.
+pub fn search_in_slice<F>(
+ item_columns: &[ArrayRef],
+ target: &[ScalarValue],
+ compare_fn: F,
+ mut low: usize,
+ high: usize,
+) -> Result<usize>
+where
+ F: Fn(&[ScalarValue], &[ScalarValue]) -> Result<bool>,
+{
+ while low < high {
Review Comment:
I don't know if this would be faster or not, but I winder if you might be
able to use `.iter()` and `find()` to find the relevant index and possibly
avoid the bounds checks 🤔
Like if you could do something like
`item_columns.iter().enumerate().find(compare_fn).map(|(i, _)| i).unwrap()`
##########
datafusion/common/src/utils.rs:
##########
@@ -22,8 +22,16 @@ use arrow::array::ArrayRef;
use arrow::compute::SortOptions;
use std::cmp::Ordering;
+/// Given column vectors, returns row at `idx`.
+pub fn get_row_at_idx(columns: &[ArrayRef], idx: usize) ->
Result<Vec<ScalarValue>> {
+ columns
+ .iter()
+ .map(|arr| ScalarValue::try_from_array(arr, idx))
Review Comment:
Longer term, I think there is quite a bit more performance to be gained by
avoiding the use of `ScalarValue` here and instead using something more
optimized like the RowFormat. This is not something for this PR I am just
trying to plan a seed if we need more performance, there is a path
##########
datafusion/common/src/utils.rs:
##########
@@ -103,6 +111,53 @@ where
Ok(low)
}
+/// This function searches for a tuple of given values (`target`) among the
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search<const SIDE: bool>(
Review Comment:
I didn't see `linear_search` used anywhere in this PR other than tests -- I
wonder if the intention was to call it in
`datafusion/physical-expr/src/window/window_frame_state.rs` ? It looks like
that currently has an inlined version of this function 🤔
##########
datafusion/physical-expr/src/window/partition_evaluator.rs:
##########
@@ -83,7 +83,7 @@ pub trait PartitionEvaluator: Debug + Send {
fn evaluate_inside_range(
&self,
_values: &[ArrayRef],
- _range: Range<usize>,
+ _range: &Range<usize>,
Review Comment:
👍
##########
datafusion/physical-expr/src/window/window_frame_state.rs:
##########
@@ -285,8 +301,16 @@ impl WindowFrameStateRange {
} else {
current_row_values
};
- // `BISECT_SIDE` true means bisect_left, false means bisect_right
- bisect::<BISECT_SIDE>(range_columns, &end_range, sort_options)
+ let search_start = if SIDE {
+ last_range.start
+ } else {
+ last_range.end
+ };
+ let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {
Review Comment:
See the above comment about
https://docs.rs/arrow/31.0.0/arrow/compute/struct.LexicographicalComparator.html
possibly being another way to improve performance
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]