[
https://issues.apache.org/jira/browse/MAPREDUCE-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969099#action_12969099
]
Scott Chen commented on MAPREDUCE-2212:
---------------------------------------
In our case, the resource is usually CPU or network bounded.
I like Joydeep's idea. It will be nice to have two separate codec options for
the intermediate compression (for disk IO) and the final output compression
(for network traffic).
> MapTask and ReduceTask should only compress/decompress the final map output
> file
> --------------------------------------------------------------------------------
>
> Key: MAPREDUCE-2212
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2212
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Affects Versions: 0.23.0
> Reporter: Scott Chen
> Assignee: Scott Chen
> Fix For: 0.23.0
>
>
> Currently if we set mapred.map.output.compression.codec
> 1. MapTask will compress every spill, decompress every spill, merge and
> compress the final map output file
> 2. ReduceTask will decompress, merge and compress every map output file. And
> repeat the compression/decompression every pass.
> This causes all the data being compressed/decompressed many times.
> The reason we need mapred.map.output.compression.codec is for network traffic.
> We should not compress/decompress the data again and again during merge sort.
> We should only compress the final map output file that will be transmitted
> over the network.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.