[jira] [Commented] (MAPREDUCE-4039) Sort Avoidance

Mariappan Asokan (JIRA) Thu, 30 Aug 2012 14:50:12 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-4039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445328#comment-13445328
 ]


Mariappan Asokan commented on MAPREDUCE-4039:
---------------------------------------------

Hi Anty,
  Sorry, I did not get back to you.  Please take a look at the patch for 
MAPREDUCE-2454.  I added a test that has a very simple implementation of 
NullSortPlugin.  You can take a look at the code there.

-- Asokan
                
> Sort Avoidance
> --------------
>
>                 Key: MAPREDUCE-4039
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4039
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: anty.rao
>            Assignee: anty
>            Priority: Minor
>             Fix For: 0.23.2
>
>         Attachments: IndexedCountingSortable.java, 
> MAPREDUCE-4039-branch-0.23.2.patch, MAPREDUCE-4039-branch-0.23.2.patch, 
> MAPREDUCE-4039-branch-0.23.2.patch
>
>
> Inspired by 
> [Tenzing|http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/37200.pdf],
>  in 5.1 MapReduce Enhanceemtns:
> {quote}*Sort Avoidance*. Certain operators such as hash join
> and hash aggregation require shuffling, but not sorting. The
> MapReduce API was enhanced to automatically turn off
> sorting for these operations. When sorting is turned off, the
> mapper feeds data to the reducer which directly passes the
> data to the Reduce() function bypassing the intermediate
> sorting step. This makes many SQL operators significantly
> more ecient.{quote}
> There are a lot of applications which need aggregation only, not 
> sorting.Using sorting to achieve aggregation is costly and inefficient. 
> Without sorting, up application can make use of hash table or hash map to do 
> aggregation efficiently.But application should bear in mind that reduce 
> memory is limited, itself is committed to manage memory of reduce, guard 
> against out of memory. Map-side combiner is not supported, you can also do 
> hash aggregation in map side  as a workaround.
> the following is the main points of sort avoidance implementation
> # add a configuration parameter ??mapreduce.sort.avoidance??, boolean type, 
> to turn on/off sort avoidance workflow.Two type of workflow are coexist 
> together.
> # key/value pairs emitted by map function is sorted by partition only, using 
> a more efficient sorting algorithm: counting sort.
> # map-side merge, use a kind of byte merge, which just concatenate bytes from 
> generated spills, read in bytes, write out bytes, without overhead of 
> key/value serialization/deserailization, comparison, which current version 
> incurs.
> # reduce can start up as soon as there is any map output available, in 
> contrast to sort workflow which must wait until all map outputs are fetched 
> and merged.
> # map output in memory can be directly consumed by reduce.When reduce can't 
> catch up with the speed of incoming map outputs, in-memory merge thread will 
> kick in, merging in-memory map outputs onto disk.
> # sequentially read in on-disk files to feed reduce, in contrast to currently 
> implementation which read multiple files concurrently, result in many disk 
> seek. Map output in memory take precedence over on disk files in feeding 
> reduce function.
> I have already implement this feature based on hadoop CDH3U3 and done some 
> performance evaluation, you can reference to 
> [https://github.com/hanborq/hadoop] for details. Now,I'm willing to port it 
> into yarn. Welcome for commenting.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-4039) Sort Avoidance

Reply via email to