Hi All,

I am submitting this notice of intent to contribute to the Hadoop community on 
behalf of Syncsort, Inc. (www.syncsort.com<http://www.syncsort.com>) an 
interface for an external sorter.  Although Hadoop MR (Map/Reduce) provides 
users with pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and 
OutputFormat it does not provide a plug-in for an external sorter. There is 
limited support to plug in a sorter class in the Map phase.  The merge logic in 
the Reduce phase cannot be changed.  Also, the sorting process is tightly 
coupled to the framework.



The goal of our project is to decouple the sorting process and contribute a 
defined clean interface to allow developers to easily plug in external sorters 
through this interface.  THIS INTERFACE WILL BE INDEPENDENT FROM SYNCSORT’S 
PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE CONTRIBUTED.

The following are some of the motivating factors for this project (not in any 
order of significance):
·         An external sort plug-in will promote innovative implementations by 
developers who have expertise in sort algorithms.
·         Hadoop developers can experiment with different sort implementations 
(in both the Map and Reduce phases) without modifying the framework code.
·         An external implementation of sort can be very well optimized to take 
advantage of OS and hardware architecture compared to the pure Java 
implementation in Hadoop.
·         The Hadoop implementation of sort is not self tuning. Users may be 
overwhelmed by so many parameters to be specified to tune the performance of 
sort.
·         One of the top memory consumers in the MR child JVMs is the sort.  
Users are advised to set a reasonably high value for -mx argument to JVM. 
Failure to do so will result in job termination. If the external sorter is 
implemented as a subprocess, it can adjust its memory usage automatically and 
make sure that it does not fail. Besides, the memory needed by the MR child JVM 
can be reduced to a meager 128 MB.
·         The performance of Hadoop sort may be at the mercy of JVM. See 
LUCENE-2504 in Hadoop Jira for a related performance regression issue. An 
external sorter implemented in C or C++ and run as a subprocess will not suffer 
from these types of problems.
·         ETL tool vendors can complement Hadoop's strengths namely HDFS, job 
scheduling, restartability, etc. with their sort technologies. This will enable 
Hadoop to make inroads into IT shops that use traditional ETL tools.
The goals of this project are:
·         The primary goal of this project is to allow users to seamlessly plug 
in the external sorter to their existing MR applications. This is in contrast 
to the approach taken by HCE (see MAPREDUCE-1270 in Hadoop Jira) which requires 
users to code their MR applications in C++.
·         A secondary goal is to enable users of existing ETL tools to exploit 
Hadoop's distributed processing framework.

We are confident there will be interest in this contribution to the code to the 
Hadoop community. I intend to provide a reference implementation of the 
interfaces defined in the design. This reference implementation uses GNU sort 
command to do the sorting of text data.

-- Asokan

M. Asokan
Technology Architect – Data Integration

Syncsort Incorporated
50 Tice Boulevard, Woodcliff Lake, NJ 07677
P: 201-930-8226 | F: 201-930-8281
E: maso...@syncsort.com<mailto:%20maso...@syncsort.com>
www.syncsort.com<http://www.syncsort.com/>

Rethink the economics of data
________________



________________________________


ATTENTION: -----

The information contained in this message (including any files transmitted with 
this message) may contain proprietary, trade secret or other confidential 
and/or legally privileged information. Any pricing information contained in 
this message or in any files transmitted with this message is always 
confidential and cannot be shared with any third parties without prior written 
approval from Syncsort. This message is intended to be read only by the 
individual or entity to whom it is addressed or by their designee. If the 
reader of this message is not the intended recipient, you are on notice that 
any use, disclosure, copying or distribution of this message, in any form, is 
strictly prohibited. If you have received this message in error, please 
immediately notify the sender and/or Syncsort and destroy all copies of this 
message in your possession, custody or control.

Reply via email to