[GitHub] [arrow-datafusion] jaylmiller commented on issue #5230: Use Arrow Row Format in SortExec


jaylmiller commented on issue #5230:
URL: 
https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1454843441

I ran some experiments investigating how batch size impacts performance when
doing multi column sorts on a single record batch.

So the batch size theory seems wrong, but these results do demonstrate why
the "preserve partitioning" cases are regressing. What's interesting is that
while single batch sorting performance for the row format is actually worse,
we're still getting significant performance increase when more than one batch
is being sorted 🤔. For example, the benchmark comps for utf8-tuple
```
group
main-sort rows-sort
-----
--------- ---------
sort utf8 tuple
1.79 62.5±13.08ms ? ?/sec 1.00 35.0±1.62ms ? ?/sec
sort utf8 tuple preserve partitioning
1.00 7.3±0.79ms ? ?/sec 1.17 8.6±0.74ms ? ?/sec
```

methodology: https://github.com/jaylmiller/inspect-arrow-sort. the actual
sorting is [right
here](https://github.com/jaylmiller/inspect-arrow-sort/blob/main/src/lib.rs#L23-L75)
and pretty much entirely lifted from the PR.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to