|
Here is a summary of some
discussions on Sort algorithms (the implementation work will be a part of
Hadoop-331). I did some benchmarking of
the sort algorithms w.r.t our needs. Attached is a table containing the
results. I have benchmarked Hadoop's MergeSort, QuickSort (a rather primitive
one, without many optimizations, from http://svn.apache.org/repos/asf/jakarta/turbine/core/trunk/src/java/org/apache/turbine/util/QuickSort.java and java.util.Arrays.sort. I
have also attached the source code for the benchmark programs. The key/value
pairs are IntWritables. The input to any sort is an array of offsets to the
beginning of key/value pairs in a buffer (containing 'n' key/value pairs). The
output of any sort is a sorted array of offsets such that
key_at_buffer[output[i]] < key_at_buffer[output[i+1]]. By the way, the
java.util.Arrays.sort(Object[]) also uses MergeSort (note - not QuickSort)
& the only difference with Hadoop's MergeSort is that it uses an array of
OBJECTs as opposed to an array of INTs. It seems like the overhead is higher in
Java's Arrays.sort case both in terms of memory footprint and the time it
takes. Of course, the input is the
worst case input - 4 byte keys, and in reality we will probably have bigger
keys and thereby end up filling the fixed-size-in-memory-map-buffer with much
lesser number of keys. In the attached table, Time: in milliseconds,
Memory: in bytes Input: Same input for each
run of a sort algo. For e.g., in the table below, for 1000 records, the input
that produced the1st result for MergeSort (Time:26, Memory: 229008) also
produced the 1st results for QuickSort (Time: 33 Memory: 229008) and
Arrays.sort (Time: 27 Memory: 243016). Ran each algo with 5 sample inputs for
each category of #records. The records are IntWritable key/value pairs. Also, attached is an
interface called SorterBase (will be a package private interface). The way I
see things is that we (hadoop developers) would have an implementation of the
interface, let’s say, according to the design spec on Hadoop-331 (in that
the sort data structures are all arrays of ints). Let's call the implementation
SorterBaseImpl1. This would have the implementation of all the methods of the
interface except "sort" (which will be an empty method). Classes MergeSort,
QuickSort and HybridSort would extend SorterBaseImpl1 and implement the sort()
method. All these algorithms would access the base class's datastructures and
since these algorithms work with very similar datastructures they can extend
from one base class -SorterBaseImpl1. If we want to have Java
based sorting (like java.util.Arrays.sort()), then we need to implement the
interface in such a way that all the datastructures are created that way (like
array of offset "objects" as opposed to int arrays, etc.). If we want to integrate a C
sort, then we need to implement the interface as such (maybe it will be possible
to reuse existing implementations like SorterBaseImpl1; implement
just the sort method differently, and have JNI wrappers to get access to
the Java datastructures within SorterBaseImpl1). An interesting suggestion here
is looking at how the C sort algorithms can be used as it is in conjunction
with streaming. Generally speaking, the
expectation from any algorithm is that it can sort an array of indirect
pointers to a buffer. If an algorithm can do that, it should be fairly easy to
accommodate the algorithm in this framework. Hadoop can have a
configurable item called "map.sort.class" which is one of MergeSort.class,
QuickSort.class, etc. and it instantiates that class and works with that (via methods
defined in the interface). The other sort algorithms
that can be looked at are STL’s sort, http://www.sgi.com/tech/stl/sort.html and optimized QuickSort (Hadoop-287). Thanks to Eric, Doug, Sam,
and Ben for the inputs. |
