[
https://issues.apache.org/jira/browse/MAPREDUCE-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924927#action_12924927
]
Ravi Gummadi commented on MAPREDUCE-2154:
-----------------------------------------
Instead of relying on the value of map output bytes counter(which is wrong
value for gridmix mapper in case of original job having a combiner), I propose
gridmix's mapper can do the following to emulate approximate disk I/O :
(1) Read mapInputBytes data first. This should mostly satisfy the
HDFS_BYTES_READ counter of map task almost all the time.
(2) Create map-output-file of size obtained by the following calculation:
Create an array of reduceInputBytes[ ] of size equal to number of reduce tasks
that contains the number of input bytes of all reduce tasks of original job.
{code}mapOutputSizeForReduce_n = reduceInputBytes[n] / numMaps;{code}
(3) At this point of time, if this simulated job's map task is behind the
FILE_BYTES_WRITTEN counter of original job's map task, then write
{code}FILE_BYTES_WRITTEN_Of_Original_Map_Task -
FILE_BYTES_WRITTEN_of_current_task{code} bytes to some temporary
localFileSystem file. Then delete this temporary file.
(4) At this point of time, if this simulated job's map task is behind the
FILE_BYTES_READ counter of original job's map task, then read
{code}FILE_BYTES_READ_Of_Original_Map_Task -
FILE_BYTES_READ_of_current_task{code} bytes from map-output-file that was
created in step-2.
Steps (3) and (4) could be important for the case where original map task had
lot of spills.
> Gridmix mapper doesn't emit the correct map output records while comparing
> with json file.
> ------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-2154
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2154
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: contrib/gridmix
> Reporter: Vinay Kumar Thota
> Assignee: Ranjit Mathew
> Attachments: wordcount.json
>
>
> I ran Gridmix with a trace file and compared the job history information
> against the trace after completion of job. The map output records in a job
> history have not matched with the map output records in a trace file. For
> reproducing the issue, please download the attached trace file and run the
> gridmix. Later compare the map output records in a job history with a trace
> file.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.