Thanks Jeff.  I filed 
HADOOP-7242(https://issues.apache.org/jira/browse/HADOOP-7242) for this 
enhancement.

-- Asokan

On 04/26/2011 02:00 PM, Jeff Hammerbacher wrote:

Hey Asokan,

Could you please file a JIRA with your proposed enhancement so that the
discussion can be archived there? See
http://wiki.apache.org/hadoop/HowToContribute for more details on how to
contribute to Hadoop.

Thanks,
Jeff

On Tue, Apr 26, 2011 at 9:46 AM, Asokan, M 
<maso...@syncsort.com><mailto:maso...@syncsort.com> wrote:



Hi Chris,
  The overall elapsed time to run a sort depends on many factors other than
the sort algorithm.  If you follow the data flow in MR from the point where
sorting starts in Map phase to the point where <Key, Value> pairs are
available for reduction in Reduce phase there are CPU and IO intensive
activities happening.  You are right, passing data to an external process
adds CPU cycles.  However, a well engineered implementation of the overall
process can cut down the elapsed time.  From some of my experiments with a
prototype implementation, I was able to cut down the elapsed time by about
40% to run some huge sorts(500 GB) on a modest cluster of 6 nodes.

Besides, an external sorter can provide additional functionalities to
Hadoop.  For example, on the Map side, an external sorter process can
support filtering, reformatting, and aggregation in a single process with
performance optimized for a multicore system.  With the current MR
framework, filtering and reformatting happen before sorting and all these
operations are very sequential in nature.  On the Reduce side, an external
sorter can offer even exotic solution like Join since the external sorter
implementation on the Reduce side is free to work on more than one
stream(one from Hadoop MR shuffled data and the other from HDFS for
example.)

Thank you very much for your feedback.  If you any more questions, please
let me know.

-- Asokan

On 04/26/2011 11:41 AM, Christopher Smith wrote:

Aren't you worried that the overhead of shoving all that data through an
external sort facility would outweigh any benefits from the algo?

--Chris

On Apr 26, 2011, at 8:34 AM, "Asokan, M" 
<maso...@syncsort.com><mailto:maso...@syncsort.com><mailto:
maso...@syncsort.com><mailto:maso...@syncsort.com> wrote:



Hi All,

I am submitting this notice of intent to contribute to the Hadoop community
on behalf of Syncsort, Inc. 
(www.syncsort.com<http://www.syncsort.com><http://www.syncsort.com><http://www.syncsort.com><
http://www.syncsort.com><http://www.syncsort.com><http://www.syncsort.com><http://www.syncsort.com>)
 an interface for an
external sorter.  Although Hadoop MR (Map/Reduce) provides users with
pluggable InputFormat, Mapper, Partitioner, Combiner, Reducer, and
OutputFormat it does not provide a plug-in for an external sorter. There is
limited support to plug in a sorter class in the Map phase.  The merge logic
in the Reduce phase cannot be changed.  Also, the sorting process is tightly
coupled to the framework.



The goal of our project is to decouple the sorting process and contribute a
defined clean interface to allow developers to easily plug in external
sorters through this interface.  THIS INTERFACE WILL BE INDEPENDENT FROM
SYNCSORT’S PROPRIETARY SOFTWARE PRODUCTS WHICH ARE NOT INTENDED TO BE
CONTRIBUTED.

The following are some of the motivating factors for this project (not in
any order of significance):
·         An external sort plug-in will promote innovative implementations
by developers who have expertise in sort algorithms.
·         Hadoop developers can experiment with different sort
implementations (in both the Map and Reduce phases) without modifying the
framework code.
·         An external implementation of sort can be very well optimized to
take advantage of OS and hardware architecture compared to the pure Java
implementation in Hadoop.
·         The Hadoop implementation of sort is not self tuning. Users may
be overwhelmed by so many parameters to be specified to tune the performance
of sort.
·         One of the top memory consumers in the MR child JVMs is the sort.
 Users are advised to set a reasonably high value for -mx argument to JVM.
Failure to do so will result in job termination. If the external sorter is
implemented as a subprocess, it can adjust its memory usage automatically
and make sure that it does not fail. Besides, the memory needed by the MR
child JVM can be reduced to a meager 128 MB.
·         The performance of Hadoop sort may be at the mercy of JVM. See
LUCENE-2504 in Hadoop Jira for a related performance regression issue. An
external sorter implemented in C or C++ and run as a subprocess will not
suffer from these types of problems.
·         ETL tool vendors can complement Hadoop's strengths namely HDFS,
job scheduling, restartability, etc. with their sort technologies. This will
enable Hadoop to make inroads into IT shops that use traditional ETL tools.
The goals of this project are:
·         The primary goal of this project is to allow users to seamlessly
plug in the external sorter to their existing MR applications. This is in
contrast to the approach taken by HCE (see MAPREDUCE-1270 in Hadoop Jira)
which requires users to code their MR applications in C++.
·         A secondary goal is to enable users of existing ETL tools to
exploit Hadoop's distributed processing framework.

We are confident there will be interest in this contribution to the code to
the Hadoop community. I intend to provide a reference implementation of the
interfaces defined in the design. This reference implementation uses GNU
sort command to do the sorting of text data.

-- Asokan

M. Asokan
Technology Architect – Data Integration

Syncsort Incorporated
50 Tice Boulevard, Woodcliff Lake, NJ 07677
P: 201-930-8226 | F: 201-930-8281
E: 
maso...@syncsort.com<mailto:maso...@syncsort.com><mailto:maso...@syncsort.com><mailto:maso...@syncsort.com><mailto:%
20maso...@syncsort.com><mailto:%20maso...@syncsort.com><mailto:%20maso...@syncsort.com><mailto:%20maso...@syncsort.com>
www.syncsort.com<http://www.syncsort.com><http://www.syncsort.com><http://www.syncsort.com><http://www.syncsort.com/><http://www.syncsort.com/><
http://www.syncsort.com/><http://www.syncsort.com/>

Rethink the economics of data
________________



________________________________


ATTENTION: -----

The information contained in this message (including any files transmitted
with this message) may contain proprietary, trade secret or other
confidential and/or legally privileged information. Any pricing information
contained in this message or in any files transmitted with this message is
always confidential and cannot be shared with any third parties without
prior written approval from Syncsort. This message is intended to be read
only by the individual or entity to whom it is addressed or by their
designee. If the reader of this message is not the intended recipient, you
are on notice that any use, disclosure, copying or distribution of this
message, in any form, is strictly prohibited. If you have received this
message in error, please immediately notify the sender and/or Syncsort and
destroy all copies of this message in your possession, custody or control.




.



________________________________


ATTENTION: -----

The information contained in this message (including any files transmitted with 
this message) may contain proprietary, trade secret or other confidential 
and/or legally privileged information. Any pricing information contained in 
this message or in any files transmitted with this message is always 
confidential and cannot be shared with any third parties without prior written 
approval from Syncsort. This message is intended to be read only by the 
individual or entity to whom it is addressed or by their designee. If the 
reader of this message is not the intended recipient, you are on notice that 
any use, disclosure, copying or distribution of this message, in any form, is 
strictly prohibited. If you have received this message in error, please 
immediately notify the sender and/or Syncsort and destroy all copies of this 
message in your possession, custody or control.

Reply via email to