[ 
https://issues.apache.org/jira/browse/HADOOP-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12649379#action_12649379
 ] 

Ravi Gummadi commented on HADOOP-2774:
--------------------------------------

So I will do the following:
I will have a new constructor of IFile.Writer(Wrapper of existing constructor) 
that will take spilledRecordsCounter as parameter. This constructor is called 
from MapTask. Writer will have a long that gets updated in append() and 
spilledRecordsCounter is updated in Writer.close().
Similarly new constructor in IFile.Reader(wrapper of existing constructor) that 
will take spilledRecordsCounter as parameter. This constructor is called from 
ReduceTask. Reader will have a long that gets updated in next() and 
spilledRecordsCounterr is updated in Reader.close().

Since Merger.merge is called from both Map and Reduce, inside Merger, we won't 
have context information of whether called from Map or Reduce. So I will send 2 
counters(say readCounter, writeCounter) to merge. In MapTask, Merger.merge(/* 
other params */, null, spilledRecordsCounter) and in ReduceTask, 
Merger.merge(/*other params */, spilledRecordsCounter, null) is called. 
Merger.merge( ) will call the new constructors with Reader(/*other params*/, 
readCounter) and Writer(/*other params*/, writeCounter) sothat Writes are 
counted in Map and Reads are counted in Reduce.

Thoughts ?

> Add counters to show number of key/values that have been sorted and merged in 
> the maps and reduces
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2774
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2774
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Owen O'Malley
>            Assignee: Ravi Gummadi
>             Fix For: 0.20.0
>
>         Attachments: HADOOP-2774.patch, HADOOP-2774.patch
>
>
> For each *pass* of the sort and merge, I would like a count of the number of 
> records. So for example, if the map output 100 records and they were sorted 
> once, the counter would be 100. If it spilled twice and was merged together, 
> it would be 200. Clearly in a multi-level merge, it may not be a multiple of 
> the number of map output records. This would let the users easily see if they 
> have values like io.sort.mb or io.sort.factor set too low.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to