alamb commented on issue #1708: URL: https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1028906123
💯 with what @Dandandan and @houqp said; Thank you for writing this up @yjshen ❤️ > I am wondering if for certain operations, e.g. hash aggregate, I feel fixed size input the data is stored better in a columnar format (mutable array, with offsets), I agree with @Dandandan that for HashAggregate this would be super helpful -- as the group keys and aggregates could be computed "in place" (so output was free) Sorting is indeed different because the sort key is different than what appears in the output. For example `SELECT a, b, c ... ORDER by a+b` needs to compare on `a+b`, but still produce tuples of `(a, b, c)`; The grouping values are produced. For example `SELECT a+b, sum(c) .. GROUP BY a+b` produces tuples of `(a+b, sum)` p.s. for what it is worth I think DuckDB has a short string optimization so the key may look something more like ```text Table A (bool a, char b, int c, string d) row_value (true, 'W', 59, "XYZ") ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐ │ 0F │ 1 │ W │ 00 │ 00 │ 00 │ 3B │ 03 │ 00 │ 00 │ 00 │ 00 │ X │ Y │ Z │ └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘ 8 Table A (bool a, char b, int c, string d) row_value (true, 'W', 59, "XYZXYZXYZ") ┌────┬────┬────┬────┬────┬────┬────┬─────────────────────────────────────────────┐ │ 0F │ 1 │ W │ 00 │ 00 │ 00 │ 3B │ PTR │ └────┴────┴────┴────┴────┴────┴────┴─────────────────────────────────────────────┘ │ 8 └───┐ ▼ "XYZXYZXYZ" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org