ozankabak commented on issue #5230: URL: https://github.com/apache/arrow-datafusion/issues/5230#issuecomment-1454917239
During my cursory look at the comments, the wording "sort all buffered batches" made me think we were sorting some sort of a coalesced dataset if there is no memory issue. So the comment is somewhat misleading (at least one person got it wrong!). Looking at the code more attentively, I see that it is doing what you are describing; i.e. buffering partially sorted batches. Given this, I am currently out of theories as to why we see the regression in the `preserve_partitioning` cases; i.e. > So we have `sort mixed tuple preserve partitioning`, `sort mixed utf8 dictionary tuple preserve partitioning`, `sort utf8 dictionary tuple preserve partitioning` and `sort utf8 tuple preserve partitioning` remaining as cases with regression. Maybe we will get more ideas when @tustvold takes a look at whether row conversion is done properly. I will keep thinking about this in parallel as well. I will share here if I can think of anything. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
