[jira] Updated: (HIVE-759) add hive.intermediate.compression.codec option

He Yongqiang (JIRA) Mon, 17 Aug 2009 22:47:51 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


He Yongqiang updated HIVE-759:
------------------------------

    Attachment: hive-759-2009-08-18.patch

Attached a new patch integrates Zheng's suggestions (Thanks Zheng!).
However, does not change the codec conf to use DefaultCodec. Use lzo as the 
default conf value can easy user's burden since lzo is better 
(http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html). 
And if we use DefaultCodec or GzipCodec as default, user could specify lzo as 
intermediate data's codec and error would also happen if lzo not loaded. So in 
order to avoid job fails at the last step, we still need to check that.

> add hive.intermediate.compression.codec option
> ----------------------------------------------
>
>                 Key: HIVE-759
>                 URL: https://issues.apache.org/jira/browse/HIVE-759
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Zheng Shao
>            Assignee: He Yongqiang
>         Attachments: hive-759-2009-08-17.patch, hive-759-2009-08-18.patch
>
>
> Hive uses the jobconf compression codec for all map-reduce jobs. This 
> includes both mapred.map.output.compression.codec and 
> mapred.output.compression.codec.
> In some cases, we want to distinguish between the codec used for intermediate 
> map-reduce jobs (that produces intermediate data between jobs) and the final 
> map-reduce jobs (that produces data stored in tables).
> For intermediate data, lzo might be a better fit because it's much faster; 
> for final data, gzip might be a better fit because it saves disk spaces.
> We should introduce two new options:
> {code}
> hive.intermediate.compression.codec=org.apache.hadoop.io.compress.LzoCodec
> hive.intermediate.compression.type=BLOCK
> {code}
> And use these 2 options to override the mapred.output.compression.* in the 
> FileSinkOperator that produces intermediate data.
> Note that it's possible that a single map-reduce job may have 2 
> FileSInkOperators: one produces intermediate data, and one produces final 
> data. So we need to add a flag to fileSinkDesc for that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-759) add hive.intermediate.compression.codec option

Reply via email to