[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924927#action_12924927
 ] 

Ravi Gummadi commented on MAPREDUCE-2154:
-----------------------------------------

Instead of relying on the value of map output bytes counter(which is wrong 
value for gridmix mapper in case of original job having a combiner), I propose 
gridmix's mapper can do the following to emulate approximate disk I/O :

(1) Read mapInputBytes data first. This should mostly satisfy the 
HDFS_BYTES_READ counter of map task almost all the time.

(2) Create map-output-file of size obtained by the following calculation:
Create an array of reduceInputBytes[ ] of size equal to number of reduce tasks 
that contains the number of input bytes of all reduce tasks of original job.
{code}mapOutputSizeForReduce_n = reduceInputBytes[n] / numMaps;{code}

(3) At this point of time, if this simulated job's map task is behind the 
FILE_BYTES_WRITTEN counter of original job's map task, then write 
{code}FILE_BYTES_WRITTEN_Of_Original_Map_Task - 
FILE_BYTES_WRITTEN_of_current_task{code} bytes to some temporary 
localFileSystem file. Then delete this temporary file.

(4) At this point of time, if this simulated job's map task is behind the 
FILE_BYTES_READ counter of original job's map task, then read 
{code}FILE_BYTES_READ_Of_Original_Map_Task - 
FILE_BYTES_READ_of_current_task{code} bytes from map-output-file that was 
created in step-2.

Steps (3) and (4) could be important for the case where original map task had 
lot of spills.


> Gridmix mapper doesn't emit the correct map output records while comparing 
> with json file.
> ------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2154
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2154
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: contrib/gridmix
>            Reporter: Vinay Kumar Thota
>            Assignee: Ranjit Mathew
>         Attachments: wordcount.json
>
>
> I ran Gridmix with a trace file and compared the job history information 
> against the trace after completion of job. The map output records in a job 
> history have not matched with the map output records in a trace file.  For 
> reproducing the issue, please download the attached trace file and run the 
> gridmix. Later compare the map output records in a job history with a trace 
> file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to