jaylmiller commented on issue #5230: URL: https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1454843441
I ran some experiments investigating how batch size impacts performance when doing multi column sorts on a single record batch. <img src="https://github.com/jaylmiller/inspect-arrow-sort/raw/main/img/mixed-tuple.png" > <img src="https://github.com/jaylmiller/inspect-arrow-sort/raw/main/img/utf8-tuple.png" > <img src="https://github.com/jaylmiller/inspect-arrow-sort/raw/main/img/dictionary-tuple.png"> <img src="https://github.com/jaylmiller/inspect-arrow-sort/raw/main/img/mixed-dictionary-tuple.png"> So the batch size theory seems wrong, but these results do demonstrate why the "preserve partitioning" cases are regressing. What's interesting is that while single batch sorting performance for the row format is actually worse, we're still getting significant performance increase when more than one batch is being sorted 🤔. For example, the benchmark comps for utf8-tuple ``` group main-sort rows-sort ----- --------- --------- sort utf8 tuple 1.79 62.5±13.08ms ? ?/sec 1.00 35.0±1.62ms ? ?/sec sort utf8 tuple preserve partitioning 1.00 7.3±0.79ms ? ?/sec 1.17 8.6±0.74ms ? ?/sec ``` methodology: https://github.com/jaylmiller/inspect-arrow-sort. the actual sorting is [right here](https://github.com/jaylmiller/inspect-arrow-sort/blob/main/src/lib.rs#L23-L75) and pretty much entirely lifted from the PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
