[ https://issues.apache.org/jira/browse/MAPREDUCE-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Lowe reassigned MAPREDUCE-5958: ------------------------------------- Assignee: Emilio Coppa Thanks for the report and patch, Emilio. We just ran across this as well, and my apologies for missing this earlier. Looks like this was caused by MAPREDUCE-2264 when the getRawDataLength stuff was added. Patch looks good overall, and I manually verified it fixes the issue with compressed map outputs. However it would be nice to have a unit test to verify this doesn't break again in the future. I see there's already unit tests in TestMerger that were intended to test this, but the tests are ineffective because they create no segments. The {{for (int i = 1; i < 1; i++)}} code in the test ends up creating no segments so nothing is merged. [~sandyr] [~devaraj.k], could you also take a look since you were familiar with the changes in MAPREDUCE-2264? > Wrong reduce task progress if map output is compressed > ------------------------------------------------------ > > Key: MAPREDUCE-5958 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5958 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1 > Reporter: Emilio Coppa > Assignee: Emilio Coppa > Priority: Minor > Labels: progress, reduce > Fix For: 2.4.1 > > Attachments: HADOOP-5958-v2.patch > > > If the map output is compressed (_mapreduce.map.output.compress_ set to > _true_) then the reduce task progress may be highly underestimated. > In the reduce phase (but also in the merge phase), the progress of a reduce > task is computed as the ratio between the number of processed bytes and the > number of total bytes. But: > - the number of total bytes is computed by summing up the uncompressed > segment sizes (_Merger.Segment.getRawDataLength()_) > - the number of processed bytes is computed by exploiting the position of the > current _IFile.Reader_ (using _IFile.Reader.getPosition()_) but this may > refer to the position in the underlying on disk file (which may be compressed) > Thus, if the map outputs are compressed then the progress may be > underestimated (e.g., only 1 map output ondisk file, the compressed file is > 25% of its original size, then the reduce task progress during the reduce > phase will range between 0 and 0.25 and then artificially jump to 1.0). > Attached there is a patch: the number of processed bytes is now computed by > exploiting _IFile.Reader.bytesRead_ (if the the reader is in memory, then > _getPosition()_ already returns exactly this field). -- This message was sent by Atlassian JIRA (v6.3.4#6332)