[
https://issues.apache.org/jira/browse/MAPREDUCE-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445328#comment-13445328
]
Mariappan Asokan commented on MAPREDUCE-4039:
---------------------------------------------
Hi Anty,
Sorry, I did not get back to you. Please take a look at the patch for
MAPREDUCE-2454. I added a test that has a very simple implementation of
NullSortPlugin. You can take a look at the code there.
-- Asokan
> Sort Avoidance
> --------------
>
> Key: MAPREDUCE-4039
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4039
> Project: Hadoop Map/Reduce
> Issue Type: New Feature
> Components: mrv2
> Affects Versions: 0.23.2
> Reporter: anty.rao
> Assignee: anty
> Priority: Minor
> Fix For: 0.23.2
>
> Attachments: IndexedCountingSortable.java,
> MAPREDUCE-4039-branch-0.23.2.patch, MAPREDUCE-4039-branch-0.23.2.patch,
> MAPREDUCE-4039-branch-0.23.2.patch
>
>
> Inspired by
> [Tenzing|http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/37200.pdf],
> in 5.1 MapReduce Enhanceemtns:
> {quote}*Sort Avoidance*. Certain operators such as hash join
> and hash aggregation require shuffling, but not sorting. The
> MapReduce API was enhanced to automatically turn off
> sorting for these operations. When sorting is turned off, the
> mapper feeds data to the reducer which directly passes the
> data to the Reduce() function bypassing the intermediate
> sorting step. This makes many SQL operators significantly
> more ecient.{quote}
> There are a lot of applications which need aggregation only, not
> sorting.Using sorting to achieve aggregation is costly and inefficient.
> Without sorting, up application can make use of hash table or hash map to do
> aggregation efficiently.But application should bear in mind that reduce
> memory is limited, itself is committed to manage memory of reduce, guard
> against out of memory. Map-side combiner is not supported, you can also do
> hash aggregation in map side as a workaround.
> the following is the main points of sort avoidance implementation
> # add a configuration parameter ??mapreduce.sort.avoidance??, boolean type,
> to turn on/off sort avoidance workflow.Two type of workflow are coexist
> together.
> # key/value pairs emitted by map function is sorted by partition only, using
> a more efficient sorting algorithm: counting sort.
> # map-side merge, use a kind of byte merge, which just concatenate bytes from
> generated spills, read in bytes, write out bytes, without overhead of
> key/value serialization/deserailization, comparison, which current version
> incurs.
> # reduce can start up as soon as there is any map output available, in
> contrast to sort workflow which must wait until all map outputs are fetched
> and merged.
> # map output in memory can be directly consumed by reduce.When reduce can't
> catch up with the speed of incoming map outputs, in-memory merge thread will
> kick in, merging in-memory map outputs onto disk.
> # sequentially read in on-disk files to feed reduce, in contrast to currently
> implementation which read multiple files concurrently, result in many disk
> seek. Map output in memory take precedence over on disk files in feeding
> reduce function.
> I have already implement this feature based on hadoop CDH3U3 and done some
> performance evaluation, you can reference to
> [https://github.com/hanborq/hadoop] for details. Now,I'm willing to port it
> into yarn. Welcome for commenting.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira