jaylmiller commented on issue #5230: URL: https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1437608442
@ozankabak I think this could be what's going on: In the `preserve partitioning` cases, it's only sorting a single batch of data (each partition receives a batch--sorts are done per partition). In the non `preserve partitioning` case, it is sorting every single batch of data. The time cost of the row encoding should scale linearly with rows (`O(n)`) , while the time cost of sorting should be `O(n*log(n))`. So I think for smaller amounts of data the upfront time cost of encoding the rows is greater than the time saved by having a more efficient comparison for sorting. But as the number of rows increases, the time cost of sorting grows faster than the encoding, making the faster comparisons more beneficial. That being said, I'm not totally sure about how to approach this issue code-wise. Suggestions would be appreciated 😅 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
