Hey Asokan, Could you please file a JIRA with your proposed enhancement so that the discussion can be archived there? See http://wiki.apache.org/hadoop/HowToContribute for more details on how to contribute to Hadoop.
Thanks, Jeff On Tue, Apr 26, 2011 at 9:46 AM, Asokan, M <maso...@syncsort.com> wrote: > Hi Chris, > The overall elapsed time to run a sort depends on many factors other than > the sort algorithm. If you follow the data flow in MR from the point where > sorting starts in Map phase to the point where <Key, Value> pairs are > available for reduction in Reduce phase there are CPU and IO intensive > activities happening. You are right, passing data to an external process > adds CPU cycles. However, a well engineered implementation of the overall > process can cut down the elapsed time. From some of my experiments with a > prototype implementation, I was able to cut down the elapsed time by about > 40% to run some huge sorts(500 GB) on a modest cluster of 6 nodes. > > Besides, an external sorter can provide additional functionalities to > Hadoop. For example, on the Map side, an external sorter process can > support filtering, reformatting, and aggregation in a single process with > performance optimized for a multicore system. With the current MR > framework, filtering and reformatting happen before sorting and all these > operations are very sequential in nature. On the Reduce side, an external > sorter can offer even exotic solution like Join since the external sorter > implementation on the Reduce side is free to work on more than one > stream(one from Hadoop MR shuffled data and the other from HDFS for > example.) > > Thank you very much for your feedback. If you any more questions, please > let me know. > > -- Asokan > > On 04/26/2011 11:41 AM, Christopher Smith wrote: > > Aren't you worried that the overhead of shoving all that data through an > external sort facility would outweigh any benefits from the algo? > > --Chris > > On Apr 26, 2011, at 8:34 AM, "Asokan, M" <maso...@syncsort.com><mailto: > maso...@syncsort.com> wrote: > > > > Hi All, > > I am submitting this notice of intent to contribute to the Hadoop community > on behalf of Syncsort, Inc. (www.syncsort.com<http://www.syncsort.com>< > http://www.syncsort.com><http://www.syncsort.com>) an interface for an > external sorter. Although Hadoop MR (Map/Reduce) provides users with > pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and > OutputFormat it does not provide a plug-in for an external sorter. There is > limited support to plug in a sorter class in the Map phase. The merge logic > in the Reduce phase cannot be changed. Also, the sorting process is tightly > coupled to the framework. > > > > The goal of our project is to decouple the sorting process and contribute a > defined clean interface to allow developers to easily plug in external > sorters through this interface. THIS INTERFACE WILL BE INDEPENDENT FROM > SYNCSORT’S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE > CONTRIBUTED. > > The following are some of the motivating factors for this project (not in > any order of significance): > · An external sort plug-in will promote innovative implementations > by developers who have expertise in sort algorithms. > · Hadoop developers can experiment with different sort > implementations (in both the Map and Reduce phases) without modifying the > framework code. > · An external implementation of sort can be very well optimized to > take advantage of OS and hardware architecture compared to the pure Java > implementation in Hadoop. > · The Hadoop implementation of sort is not self tuning. Users may > be overwhelmed by so many parameters to be specified to tune the performance > of sort. > · One of the top memory consumers in the MR child JVMs is the sort. > Users are advised to set a reasonably high value for -mx argument to JVM. > Failure to do so will result in job termination. If the external sorter is > implemented as a subprocess, it can adjust its memory usage automatically > and make sure that it does not fail. Besides, the memory needed by the MR > child JVM can be reduced to a meager 128 MB. > · The performance of Hadoop sort may be at the mercy of JVM. See > LUCENE-2504 in Hadoop Jira for a related performance regression issue. An > external sorter implemented in C or C++ and run as a subprocess will not > suffer from these types of problems. > · ETL tool vendors can complement Hadoop's strengths namely HDFS, > job scheduling, restartability, etc. with their sort technologies. This will > enable Hadoop to make inroads into IT shops that use traditional ETL tools. > The goals of this project are: > · The primary goal of this project is to allow users to seamlessly > plug in the external sorter to their existing MR applications. This is in > contrast to the approach taken by HCE (see MAPREDUCE-1270 in Hadoop Jira) > which requires users to code their MR applications in C++. > · A secondary goal is to enable users of existing ETL tools to > exploit Hadoop's distributed processing framework. > > We are confident there will be interest in this contribution to the code to > the Hadoop community. I intend to provide a reference implementation of the > interfaces defined in the design. This reference implementation uses GNU > sort command to do the sorting of text data. > > -- Asokan > > M. Asokan > Technology Architect – Data Integration > > Syncsort Incorporated > 50 Tice Boulevard, Woodcliff Lake, NJ 07677 > P: 201-930-8226 | F: 201-930-8281 > E: maso...@syncsort.com<mailto:maso...@syncsort.com><mailto:% > 20maso...@syncsort.com><mailto:%20maso...@syncsort.com> > www.syncsort.com<http://www.syncsort.com><http://www.syncsort.com/>< > http://www.syncsort.com/> > > Rethink the economics of data > ________________ > > > > ________________________________ > > > ATTENTION: ----- > > The information contained in this message (including any files transmitted > with this message) may contain proprietary, trade secret or other > confidential and/or legally privileged information. Any pricing information > contained in this message or in any files transmitted with this message is > always confidential and cannot be shared with any third parties without > prior written approval from Syncsort. This message is intended to be read > only by the individual or entity to whom it is addressed or by their > designee. If the reader of this message is not the intended recipient, you > are on notice that any use, disclosure, copying or distribution of this > message, in any form, is strictly prohibited. If you have received this > message in error, please immediately notify the sender and/or Syncsort and > destroy all copies of this message in your possession, custody or control. > >