[GitHub] [arrow-datafusion] jaylmiller commented on issue #5230: Use Arrow Row Format in SortExec

via GitHub Mon, 20 Feb 2023 14:15:32 -0800


jaylmiller commented on issue #5230:
URL: 
https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1437608442


   @ozankabak I think this could be what's going on:
   
   In the `preserve partitioning` cases, it's only sorting a single batch of 
data (each partition receives a batch--sorts are done per partition). In the 
non `preserve partitioning` case, it is sorting every single batch of data. The 
time cost of the row encoding should scale linearly with rows (`O(n)`) , while 
the time cost of sorting should be `O(n*log(n))`. 
   
   So I think for smaller amounts of data the upfront time cost of encoding the 
rows is greater than the time saved by having a more efficient comparison for 
sorting. But as the number of rows increases, the time cost of sorting grows 
faster than the encoding, making the faster comparisons more beneficial.
   
   That being said, I'm not totally sure about how to approach this issue 
code-wise. Suggestions would be appreciated 😅


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jaylmiller commented on issue #5230: Use Arrow Row Format in SortExec

Reply via email to