Sort algorithms + interface

Devaraj Das Tue, 14 Nov 2006 21:02:07 -0800

Here is a summary of some discussions on Sort algorithms (the implementation work will be a part of Hadoop-331).

I did some benchmarking of the sort algorithms w.r.t our needs. Attached is a table containing the results. I have benchmarked Hadoop's MergeSort, QuickSort (a rather primitive one, without many optimizations, from http://svn.apache.org/repos/asf/jakarta/turbine/core/trunk/src/java/org/apache/turbine/util/QuickSort.java

and java.util.Arrays.sort. I have also attached the source code for the benchmark programs. The key/value pairs are IntWritables. The input to any sort is an array of offsets to the beginning of key/value pairs in a buffer (containing 'n' key/value pairs). The output of any sort is a sorted array of offsets such that key_at_buffer[output[i]] < key_at_buffer[output[i+1]].

By the way, the java.util.Arrays.sort(Object[]) also uses MergeSort (note - not QuickSort) & the only difference with Hadoop's MergeSort is that it uses an array of OBJECTs as opposed to an array of INTs. It seems like the overhead is higher in Java's Arrays.sort case both in terms of memory footprint and the time it takes. Of

course, the input is the worst case input - 4 byte keys, and in reality we will probably have bigger keys and thereby end up filling the fixed-size-in-memory-map-buffer with much lesser number of keys.

In the attached table,

Time: in milliseconds, Memory: in bytes

Input: Same input for each run of a sort algo. For e.g., in the table below, for 1000 records, the input that produced the1st result for MergeSort (Time:26, Memory: 229008) also produced the 1st results for QuickSort (Time: 33 Memory: 229008) and Arrays.sort (Time: 27 Memory: 243016). Ran each algo with 5 sample inputs for each category of #records. The records are IntWritable key/value pairs.

Also, attached is an interface called SorterBase (will be a package private interface). The way I see things is that we (hadoop developers) would have an implementation of the interface, let’s say, according to the design spec on Hadoop-331 (in that the sort data structures are all arrays of ints). Let's call the implementation SorterBaseImpl1. This would have the implementation of all the methods of the interface except "sort" (which will be an empty method). Classes MergeSort, QuickSort and HybridSort would extend SorterBaseImpl1 and implement the sort() method. All these algorithms would access the base class's datastructures and since these algorithms work with very similar datastructures they can extend from one base class -SorterBaseImpl1.

If we want to have Java based sorting (like java.util.Arrays.sort()), then

we need to implement the interface in such a way that all the datastructures

are created that way (like array of offset "objects" as opposed to int

arrays, etc.).

If we want to integrate a C sort, then we need to implement the interface as such

(maybe it will be possible to reuse existing implementations like

SorterBaseImpl1; implement just the sort method differently, and have JNI

wrappers to get access to the Java datastructures within SorterBaseImpl1). An interesting suggestion here is looking at how the C sort algorithms can be used as it is in conjunction with streaming.

Generally speaking, the expectation from any algorithm is that it can sort

an array of indirect pointers to a buffer. If an algorithm can do that, it

should be fairly easy to accommodate the algorithm in this framework.

Hadoop can have a configurable item called "map.sort.class" which is one of

MergeSort.class, QuickSort.class, etc. and it instantiates that class and

works with that (via methods defined in the interface).

The other sort algorithms that can be looked at are STL’s sort, http://www.sgi.com/tech/stl/sort.html

and optimized QuickSort (Hadoop-287).

Thanks to Eric, Doug, Sam, and Ben for the inputs.

Sort algorithms + interface

Reply via email to