[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Chris Douglas (JIRA) Tue, 07 Dec 2010 16:10:34 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969085#action_12969085
 ]


Chris Douglas commented on MAPREDUCE-2212:
------------------------------------------

Todd's point on disk bandwidth matches some benchmarks we did a couple years 
ago. Compressing the intermediate data improved the spill and merge times. It 
would be interesting to see if those results hold today, and for which codecs.

In the case where no records are collected after the soft spill, the 
intermediate output will either need to be rewritten (since the reduce is 
expecting compressed output) or the shuffle will need to handle mixed segments. 
It's a rare case, but one the framework would need to handle.

> MapTask and ReduceTask should only compress/decompress the final map output 
> file
> --------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-2212
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.23.0
>            Reporter: Scott Chen
>            Assignee: Scott Chen
>             Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and 
> compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And 
> repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted 
> over the network.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-2212) MapTask and ReduceTask should only compress/decompress the final map output file

Reply via email to