korowa commented on code in PR #14232: URL: https://github.com/apache/datafusion/pull/14232#discussion_r1929071971
########## datafusion/functions-aggregate/src/first_last.rs: ########## @@ -701,9 +713,98 @@ fn convert_to_sort_cols(arrs: &[ArrayRef], sort_exprs: &LexOrdering) -> Vec<Sort #[cfg(test)] mod tests { use arrow::array::Int64Array; + use arrow_schema::Schema; + use compute::SortOptions; + use datafusion_physical_expr::{expressions::col, PhysicalSortExpr}; use super::*; + #[test] + fn test_last_value_with_order_bys() -> Result<()> { + // TODO: Move this kind of test to slt, we don't have a nice way to define the batch size for each `update_batch` Review Comment: Reading data from source file (not sure about memory tables) into aggregation with `batch_size = N` should help. Is this not working? ########## datafusion/functions-aggregate/src/first_last.rs: ########## @@ -569,6 +573,13 @@ impl LastValueAccumulator { }) .collect::<Vec<_>>(); + // Order by indices for cases where the values are the same, we expect the last index + let indices: UInt64Array = (0..num_rows).map(|x| x as u64).collect(); + sort_columns.push(SortColumn { Review Comment: This part is a bit concerning in terms of performance -- won't sorting by multiple columns (due to row indices being added, now there are no cases with a single column) be noticeably slower due to falling back to lexicographical comparator? If so, perhaps, it should either be worked around (to somehow sort leaving last `is_ge()` value) or maybe it doesn't worth that, since this will work only for single threaded queries, while, I suppose, the primarily use-case is multithreaded execution, which messes up the original order of records in files due to parallel reads followed by repartitioning. \+ Not sure if current behaviour of this function is incorrect and must be fixed, it looks more like nice-to-have feature (but that is just an opinion). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org