[
https://issues.apache.org/jira/browse/MAPREDUCE-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13291599#comment-13291599
]
longfei commented on MAPREDUCE-4039:
------------------------------------
The patch Did not give the class org.apache.hadoop.util.IndexedCountingSortable.
> Sort Avoidance
> --------------
>
> Key: MAPREDUCE-4039
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4039
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: mrv2
> Affects Versions: 0.23.2
> Reporter: anty.rao
> Assignee: anty
> Priority: Minor
> Fix For: 0.23.2
>
> Attachments: MAPREDUCE-4039-branch-0.23.2.patch,
> MAPREDUCE-4039-branch-0.23.2.patch
>
>
> Inspired by
> [Tenzing|http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/37200.pdf],
> in 5.1 MapReduce Enhanceemtns:
> {quote}*Sort Avoidance*. Certain operators such as hash join
> and hash aggregation require shuffling, but not sorting. The
> MapReduce API was enhanced to automatically turn off
> sorting for these operations. When sorting is turned off, the
> mapper feeds data to the reducer which directly passes the
> data to the Reduce() function bypassing the intermediate
> sorting step. This makes many SQL operators significantly
> more ecient.{quote}
> There are a lot of applications which need aggregation only, not
> sorting.Using sorting to achieve aggregation is costly and inefficient.
> Without sorting, up application can make use of hash table or hash map to do
> aggregation efficiently.But application should bear in mind that reduce
> memory is limited, itself is committed to manage memory of reduce, guard
> against out of memory. Map-side combiner is not supported, you can also do
> hash aggregation in map side as a workaround.
> the following is the main points of sort avoidance implementation
> # add a configuration parameter ??mapreduce.sort.avoidance??, boolean type,
> to turn on/off sort avoidance workflow.Two type of workflow are coexist
> together.
> # key/value pairs emitted by map function is sorted by partition only, using
> a more efficient sorting algorithm: counting sort.
> # map-side merge, use a kind of byte merge, which just concatenate bytes from
> generated spills, read in bytes, write out bytes, without overhead of
> key/value serialization/deserailization, comparison, which current version
> incurs.
> # reduce can start up as soon as there is any map output available, in
> contrast to sort workflow which must wait until all map outputs are fetched
> and merged.
> # map output in memory can be directly consumed by reduce.When reduce can't
> catch up with the speed of incoming map outputs, in-memory merge thread will
> kick in, merging in-memory map outputs onto disk.
> # sequentially read in on-disk files to feed reduce, in contrast to currently
> implementation which read multiple files concurrently, result in many disk
> seek. Map output in memory take precedence over on disk files in feeding
> reduce function.
> I have already implement this feature based on hadoop CDH3U3 and done some
> performance evaluation, you can reference to
> [https://github.com/hanborq/hadoop] for details. Now,I'm willing to port it
> into yarn. Welcome for commenting.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira