[ 
https://issues.apache.org/jira/browse/HADOOP-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated HADOOP-3131:
----------------------------------

    Affects Version/s:     (was: 0.16.1)
                       0.18.0
                       0.17.1
                       0.17.0
               Status: Patch Available  (was: Open)

The problem was that SequenceFile.Sorter.MergeQueue calculates progress as 
(total size of keys and values read) / (total size of files to be merged on 
disk). When a file is compressed, the file size is much smaller than the 
combined sizes of the keys and values. In fact, there is also a problem when 
compression is turned off - the code returns progress less than 100% because it 
does not count bytes in the file that are not part of keys and values, such as 
the header and length fields. This patch changes MergeQueue to use the position 
in the input stream to calculate number of bytes read from disk and divide that 
by the total amount of data to be merged.

> enabling BLOCK compression for map outputs breaks the reduce progress counters
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-3131
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3131
>             Project: Hadoop Core
>          Issue Type: Bug
>    Affects Versions: 0.17.0, 0.17.1, 0.18.0
>            Reporter: Colin Evans
>         Attachments: Picture 1.png
>
>
> Enabling map output compression and setting the compression type to BLOCK 
> causes the progress counters during the reduce to go crazy and report 
> progress counts over 100%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to