[
https://issues.apache.org/jira/browse/DRILL-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steven Phillips updated DRILL-386:
----------------------------------
Attachment: DRILL-386.patch
Fixed a memory leak
> Implement External Sort operator
> --------------------------------
>
> Key: DRILL-386
> URL: https://issues.apache.org/jira/browse/DRILL-386
> Project: Apache Drill
> Issue Type: New Feature
> Reporter: Steven Phillips
> Assignee: Steven Phillips
> Attachments: DRILL-386.patch, DRILL-386.patch, DRILL-386.patch,
> DRILL-386.patch
>
>
> This operator will allow sorting data that is larger than the size of memory
> by spilling to disk when necessary.
> Also as part of this jira, I will implement a new merge sort algorithm that
> will hopefully better utilize cluster resources than our current sort, which
> is based on Quicksort. The problem with quicksort is that we can't do any
> sorting until all of the batches have arrived. But this will often result in
> very low CPU utilization while the data is read from disk, followed by a
> period of very high CPU usage.
> The external sort will include sorting each batch individually when it
> arrives. In the case where no spills occur, it makes sense to take advantage
> of the fact that the batches are already sorted, but doing the an N-way
> merge, as is done when there are spills, is not the most effective way to do
> this, (according to some tests I have done). What will be done in this case,
> rather than an N-way merge using a heap, we will do a variation of natural
> merge sort, which amounts to essentially a staged, 2-way merge of the
> incoming (sorted) batches.
--
This message was sent by Atlassian JIRA
(v6.2#6252)