Aren't you worried that the overhead of shoving all that data through an external sort facility would outweigh any benefits from the algo?
--Chris On Apr 26, 2011, at 8:34 AM, "Asokan, M" <maso...@syncsort.com> wrote: > Hi All, > > I am submitting this notice of intent to contribute to the Hadoop community > on behalf of Syncsort, Inc. (www.syncsort.com<http://www.syncsort.com>) an > interface for an external sorter. Although Hadoop MR (Map/Reduce) provides > users with pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and > OutputFormat it does not provide a plug-in for an external sorter. There is > limited support to plug in a sorter class in the Map phase. The merge logic > in the Reduce phase cannot be changed. Also, the sorting process is tightly > coupled to the framework. > > > > The goal of our project is to decouple the sorting process and contribute a > defined clean interface to allow developers to easily plug in external > sorters through this interface. THIS INTERFACE WILL BE INDEPENDENT FROM > SYNCSORT’S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE > CONTRIBUTED. > > The following are some of the motivating factors for this project (not in any > order of significance): > · An external sort plug-in will promote innovative implementations by > developers who have expertise in sort algorithms. > · Hadoop developers can experiment with different sort > implementations (in both the Map and Reduce phases) without modifying the > framework code. > · An external implementation of sort can be very well optimized to > take advantage of OS and hardware architecture compared to the pure Java > implementation in Hadoop. > · The Hadoop implementation of sort is not self tuning. Users may be > overwhelmed by so many parameters to be specified to tune the performance of > sort. > · One of the top memory consumers in the MR child JVMs is the sort. > Users are advised to set a reasonably high value for -mx argument to JVM. > Failure to do so will result in job termination. If the external sorter is > implemented as a subprocess, it can adjust its memory usage automatically and > make sure that it does not fail. Besides, the memory needed by the MR child > JVM can be reduced to a meager 128 MB. > · The performance of Hadoop sort may be at the mercy of JVM. See > LUCENE-2504 in Hadoop Jira for a related performance regression issue. An > external sorter implemented in C or C++ and run as a subprocess will not > suffer from these types of problems. > · ETL tool vendors can complement Hadoop's strengths namely HDFS, job > scheduling, restartability, etc. with their sort technologies. This will > enable Hadoop to make inroads into IT shops that use traditional ETL tools. > The goals of this project are: > · The primary goal of this project is to allow users to seamlessly > plug in the external sorter to their existing MR applications. This is in > contrast to the approach taken by HCE (see MAPREDUCE-1270 in Hadoop Jira) > which requires users to code their MR applications in C++. > · A secondary goal is to enable users of existing ETL tools to > exploit Hadoop's distributed processing framework. > > We are confident there will be interest in this contribution to the code to > the Hadoop community. I intend to provide a reference implementation of the > interfaces defined in the design. This reference implementation uses GNU sort > command to do the sorting of text data. > > -- Asokan > > M. Asokan > Technology Architect – Data Integration > > Syncsort Incorporated > 50 Tice Boulevard, Woodcliff Lake, NJ 07677 > P: 201-930-8226 | F: 201-930-8281 > E: maso...@syncsort.com<mailto:%20maso...@syncsort.com> > www.syncsort.com<http://www.syncsort.com/> > > Rethink the economics of data > ________________ > > > > ________________________________ > > > ATTENTION: ----- > > The information contained in this message (including any files transmitted > with this message) may contain proprietary, trade secret or other > confidential and/or legally privileged information. Any pricing information > contained in this message or in any files transmitted with this message is > always confidential and cannot be shared with any third parties without prior > written approval from Syncsort. This message is intended to be read only by > the individual or entity to whom it is addressed or by their designee. If the > reader of this message is not the intended recipient, you are on notice that > any use, disclosure, copying or distribution of this message, in any form, is > strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or Syncsort and destroy all copies of this > message in your possession, custody or control.