[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #4989: Add support for linear range calculation in WINDOW functions

2023-01-20 Thread via GitHub


ozankabak commented on code in PR #4989:
URL: https://github.com/apache/arrow-datafusion/pull/4989#discussion_r1083058356


##
datafusion/common/src/utils.rs:
##
@@ -103,6 +111,53 @@ where
 Ok(low)
 }
 
+/// This function searches for a tuple of given values (`target`) among the 
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is 
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search(
+item_columns: &[ArrayRef],
+target: &[ScalarValue],
+sort_options: &[SortOptions],
+) -> Result {
+let low: usize = 0;
+let high: usize = item_columns
+.get(0)
+.ok_or_else(|| {
+DataFusionError::Internal("Column array shouldn't be 
empty".to_string())
+})?
+.len();
+let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {
+let cmp = compare_rows(current, target, sort_options)?;
+Ok(if SIDE { cmp.is_lt() } else { cmp.is_le() })
+};
+search_in_slice(item_columns, target, compare_fn, low, high)
+}
+
+/// This function searches for a tuple of given values (`target`) among a 
slice of
+/// the given rows (`item_columns`) via a linear scan. The slice starts at the 
index
+/// `low` and ends at the index `high`. The boolean-valued function 
`compare_fn`
+/// specifies the stopping criterion.
+pub fn search_in_slice(
+item_columns: &[ArrayRef],
+target: &[ScalarValue],
+compare_fn: F,
+mut low: usize,
+high: usize,
+) -> Result
+where
+F: Fn(&[ScalarValue], &[ScalarValue]) -> Result,
+{
+while low < high {

Review Comment:
   Yes, exactly. But let's still keep this in our minds in the background, and 
improve this section in the future if anyone finds neat way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #4989: Add support for linear range calculation in WINDOW functions

2023-01-20 Thread GitBox


ozankabak commented on code in PR #4989:
URL: https://github.com/apache/arrow-datafusion/pull/4989#discussion_r1082791126


##
datafusion/common/src/utils.rs:
##
@@ -103,6 +111,53 @@ where
 Ok(low)
 }
 
+/// This function searches for a tuple of given values (`target`) among the 
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is 
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search(
+item_columns: &[ArrayRef],
+target: &[ScalarValue],
+sort_options: &[SortOptions],
+) -> Result {
+let low: usize = 0;
+let high: usize = item_columns
+.get(0)
+.ok_or_else(|| {
+DataFusionError::Internal("Column array shouldn't be 
empty".to_string())
+})?
+.len();
+let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {
+let cmp = compare_rows(current, target, sort_options)?;
+Ok(if SIDE { cmp.is_lt() } else { cmp.is_le() })
+};
+search_in_slice(item_columns, target, compare_fn, low, high)
+}
+
+/// This function searches for a tuple of given values (`target`) among a 
slice of
+/// the given rows (`item_columns`) via a linear scan. The slice starts at the 
index
+/// `low` and ends at the index `high`. The boolean-valued function 
`compare_fn`
+/// specifies the stopping criterion.
+pub fn search_in_slice(
+item_columns: &[ArrayRef],
+target: &[ScalarValue],
+compare_fn: F,
+mut low: usize,
+high: usize,
+) -> Result
+where
+F: Fn(&[ScalarValue], &[ScalarValue]) -> Result,
+{
+while low < high {

Review Comment:
   I think you mean something like this:
   ```rust
   Ok((low..high).find(|&idx| {
   let val = get_row_at_idx(item_columns, idx)?;
   !compare_fn(&val, target)?
   }).unwrap_or(high))
   ```
   
   The problem is with the `?` operators, we would need to change them to 
`unwrap` calls for this to work. The code would look nicer, but we would be 
incurring the downside of panicking in case something goes wrong. In general, I 
prefer to err on the side of being a little more verbose than necessary but 
retain control over errors, but I don't have a strong opinion on this specific 
case. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #4989: Add support for linear range calculation in WINDOW functions

2023-01-20 Thread GitBox


ozankabak commented on code in PR #4989:
URL: https://github.com/apache/arrow-datafusion/pull/4989#discussion_r1082791126


##
datafusion/common/src/utils.rs:
##
@@ -103,6 +111,53 @@ where
 Ok(low)
 }
 
+/// This function searches for a tuple of given values (`target`) among the 
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is 
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search(
+item_columns: &[ArrayRef],
+target: &[ScalarValue],
+sort_options: &[SortOptions],
+) -> Result {
+let low: usize = 0;
+let high: usize = item_columns
+.get(0)
+.ok_or_else(|| {
+DataFusionError::Internal("Column array shouldn't be 
empty".to_string())
+})?
+.len();
+let compare_fn = |current: &[ScalarValue], target: &[ScalarValue]| {
+let cmp = compare_rows(current, target, sort_options)?;
+Ok(if SIDE { cmp.is_lt() } else { cmp.is_le() })
+};
+search_in_slice(item_columns, target, compare_fn, low, high)
+}
+
+/// This function searches for a tuple of given values (`target`) among a 
slice of
+/// the given rows (`item_columns`) via a linear scan. The slice starts at the 
index
+/// `low` and ends at the index `high`. The boolean-valued function 
`compare_fn`
+/// specifies the stopping criterion.
+pub fn search_in_slice(
+item_columns: &[ArrayRef],
+target: &[ScalarValue],
+compare_fn: F,
+mut low: usize,
+high: usize,
+) -> Result
+where
+F: Fn(&[ScalarValue], &[ScalarValue]) -> Result,
+{
+while low < high {

Review Comment:
   I think you mean something like this:
   ```rust
   Ok((low..high).find(|&idx| {
   let val = get_row_at_idx(item_columns, idx)?;
   !compare_fn(&val, target)?
   }).unwrap_or(high))
   ```
   
   The problem is with the `?` operators, we would need to change them to 
`unwrap` calls for this to work. The code would look nicer, but we would 
incurring the downside of panicking in case something goes wrong. In general, I 
prefer to err on the side of being a little more verbose than necessary but 
retain control over errors, but I don't have a strong opinion on this specific 
case. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-datafusion] ozankabak commented on a diff in pull request #4989: Add support for linear range calculation in WINDOW functions

2023-01-20 Thread GitBox


ozankabak commented on code in PR #4989:
URL: https://github.com/apache/arrow-datafusion/pull/4989#discussion_r1082640117


##
datafusion/common/src/utils.rs:
##
@@ -103,6 +111,53 @@ where
 Ok(low)
 }
 
+/// This function searches for a tuple of given values (`target`) among the 
given
+/// rows (`item_columns`) via a linear scan. It assumes that `item_columns` is 
sorted
+/// according to `sort_options` and returns the insertion index of `target`.
+/// Template argument `SIDE` being `true`/`false` means left/right insertion.
+pub fn linear_search(

Review Comment:
   Yes, so the same logic has two drivers: One with a comparison function, one 
with `SortOptions`. We currently use the former, but also anticipate to use the 
latter in the near future (we plan a follow-up of this PR for GROUPS mode). As 
@mustafasrepo mentions, it also brings both search APIs in line, which is good 
too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org