korowa commented on code in PR #14232:
URL: https://github.com/apache/datafusion/pull/14232#discussion_r1929071971


##########
datafusion/functions-aggregate/src/first_last.rs:
##########
@@ -701,9 +713,98 @@ fn convert_to_sort_cols(arrs: &[ArrayRef], sort_exprs: 
&LexOrdering) -> Vec<Sort
 #[cfg(test)]
 mod tests {
     use arrow::array::Int64Array;
+    use arrow_schema::Schema;
+    use compute::SortOptions;
+    use datafusion_physical_expr::{expressions::col, PhysicalSortExpr};
 
     use super::*;
 
+    #[test]
+    fn test_last_value_with_order_bys() -> Result<()> {
+        // TODO: Move this kind of test to slt, we don't have a nice way to 
define the batch size for each `update_batch`

Review Comment:
   Reading data from source file (not sure about memory tables) into 
aggregation with `batch_size = N` should help. Is this not working?



##########
datafusion/functions-aggregate/src/first_last.rs:
##########
@@ -569,6 +573,13 @@ impl LastValueAccumulator {
             })
             .collect::<Vec<_>>();
 
+        // Order by indices for cases where the values are the same, we expect 
the last index
+        let indices: UInt64Array = (0..num_rows).map(|x| x as u64).collect();
+        sort_columns.push(SortColumn {

Review Comment:
   This part is a bit concerning in terms of performance -- won't sorting by 
multiple columns (due to row indices being added, now there are no cases with a 
single column) be noticeably slower due to falling back to lexicographical 
comparator?
   
   If so, perhaps, it should either be worked around (to somehow sort leaving 
last `is_ge()` value) or maybe it doesn't worth that, since this will work only 
for single threaded queries, while, I suppose, the primarily use-case is 
multithreaded execution, which messes up the original order of records in files 
due to parallel reads followed by repartitioning.
   
   \+ Not sure if current behaviour of this function is incorrect and must be 
fixed, it looks more like nice-to-have feature (but that is just an opinion).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to