Aren't you worried that the overhead of shoving all that data through an 
external sort facility would outweigh any benefits from the algo?

--Chris

On Apr 26, 2011, at 8:34 AM, "Asokan, M" <maso...@syncsort.com> wrote:

> Hi All,
> 
> I am submitting this notice of intent to contribute to the Hadoop community 
> on behalf of Syncsort, Inc. (www.syncsort.com<http://www.syncsort.com>) an 
> interface for an external sorter.  Although Hadoop MR (Map/Reduce) provides 
> users with pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and 
> OutputFormat it does not provide a plug-in for an external sorter. There is 
> limited support to plug in a sorter class in the Map phase.  The merge logic 
> in the Reduce phase cannot be changed.  Also, the sorting process is tightly 
> coupled to the framework.
> 
> 
> 
> The goal of our project is to decouple the sorting process and contribute a 
> defined clean interface to allow developers to easily plug in external 
> sorters through this interface.  THIS INTERFACE WILL BE INDEPENDENT FROM 
> SYNCSORT’S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE 
> CONTRIBUTED.
> 
> The following are some of the motivating factors for this project (not in any 
> order of significance):
> ·         An external sort plug-in will promote innovative implementations by 
> developers who have expertise in sort algorithms.
> ·         Hadoop developers can experiment with different sort 
> implementations (in both the Map and Reduce phases) without modifying the 
> framework code.
> ·         An external implementation of sort can be very well optimized to 
> take advantage of OS and hardware architecture compared to the pure Java 
> implementation in Hadoop.
> ·         The Hadoop implementation of sort is not self tuning. Users may be 
> overwhelmed by so many parameters to be specified to tune the performance of 
> sort.
> ·         One of the top memory consumers in the MR child JVMs is the sort.  
> Users are advised to set a reasonably high value for -mx argument to JVM. 
> Failure to do so will result in job termination. If the external sorter is 
> implemented as a subprocess, it can adjust its memory usage automatically and 
> make sure that it does not fail. Besides, the memory needed by the MR child 
> JVM can be reduced to a meager 128 MB.
> ·         The performance of Hadoop sort may be at the mercy of JVM. See 
> LUCENE-2504 in Hadoop Jira for a related performance regression issue. An 
> external sorter implemented in C or C++ and run as a subprocess will not 
> suffer from these types of problems.
> ·         ETL tool vendors can complement Hadoop's strengths namely HDFS, job 
> scheduling, restartability, etc. with their sort technologies. This will 
> enable Hadoop to make inroads into IT shops that use traditional ETL tools.
> The goals of this project are:
> ·         The primary goal of this project is to allow users to seamlessly 
> plug in the external sorter to their existing MR applications. This is in 
> contrast to the approach taken by HCE (see MAPREDUCE-1270 in Hadoop Jira) 
> which requires users to code their MR applications in C++.
> ·         A secondary goal is to enable users of existing ETL tools to 
> exploit Hadoop's distributed processing framework.
> 
> We are confident there will be interest in this contribution to the code to 
> the Hadoop community. I intend to provide a reference implementation of the 
> interfaces defined in the design. This reference implementation uses GNU sort 
> command to do the sorting of text data.
> 
> -- Asokan
> 
> M. Asokan
> Technology Architect – Data Integration
> 
> Syncsort Incorporated
> 50 Tice Boulevard, Woodcliff Lake, NJ 07677
> P: 201-930-8226 | F: 201-930-8281
> E: maso...@syncsort.com<mailto:%20maso...@syncsort.com>
> www.syncsort.com<http://www.syncsort.com/>
> 
> Rethink the economics of data
> ________________
> 
> 
> 
> ________________________________
> 
> 
> ATTENTION: -----
> 
> The information contained in this message (including any files transmitted 
> with this message) may contain proprietary, trade secret or other 
> confidential and/or legally privileged information. Any pricing information 
> contained in this message or in any files transmitted with this message is 
> always confidential and cannot be shared with any third parties without prior 
> written approval from Syncsort. This message is intended to be read only by 
> the individual or entity to whom it is addressed or by their designee. If the 
> reader of this message is not the intended recipient, you are on notice that 
> any use, disclosure, copying or distribution of this message, in any form, is 
> strictly prohibited. If you have received this message in error, please 
> immediately notify the sender and/or Syncsort and destroy all copies of this 
> message in your possession, custody or control.

Reply via email to