[GitHub] [arrow-datafusion] ozankabak commented on issue #5230: Use Arrow Row Format in SortExec

via GitHub Sat, 04 Mar 2023 15:11:14 -0800


ozankabak commented on issue #5230:
URL: 
https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1454917239


   During my cursory look at the comments, the wording "sort all buffered 
batches" made me think we were sorting some sort of a coalesced dataset if 
there is no memory issue. So the comment is somewhat misleading (at least one 
person got it wrong!).
   
   Looking at the code more attentively, I see that it is doing what you are 
describing; i.e. buffering partially sorted batches. Given this, I am currently 
out of theories as to why we see the regression in the `preserve_partitioning` 
cases; i.e.
   
   > So we have `sort mixed tuple preserve partitioning`, `sort mixed utf8 
dictionary tuple preserve partitioning`, `sort utf8 dictionary tuple preserve 
partitioning` and `sort utf8 tuple preserve partitioning` remaining as cases 
with regression.
   
   Maybe we will get more ideas when @tustvold takes a look at whether row 
conversion is done properly. I will keep thinking about this in parallel as 
well. I will share here if I can think of anything.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] ozankabak commented on issue #5230: Use Arrow Row Format in SortExec

Reply via email to