Steven Phillips created DRILL-386:
-------------------------------------
Summary: Implement External Sort operator
Key: DRILL-386
URL: https://issues.apache.org/jira/browse/DRILL-386
Project: Apache Drill
Issue Type: New Feature
Reporter: Steven Phillips
Assignee: Steven Phillips
This operator will allow sorting data that is larger than the size of memory by
spilling to disk when necessary.
Also as part of this jira, I will implement a new merge sort algorithm that
will hopefully better utilize cluster resources than our current sort, which is
based on Quicksort. The problem with quicksort is that we can't do any sorting
until all of the batches have arrived. But this will often result in very low
CPU utilization while the data is read from disk, followed by a period of very
high CPU usage.
The external sort will include sorting each batch individually when it arrives.
In the case where no spills occur, it makes sense to take advantage of the fact
that the batches are already sorted, but doing the an N-way merge, as is done
when there are spills, is not the most effective way to do this, (according to
some tests I have done). What will be done in this case, rather than an N-way
merge using a heap, we will do a variation of natural merge sort, which amounts
to essentially a staged, 2-way merge of the incoming (sorted) batches.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)