[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897005#action_12897005
 ] 

Yan Zhou commented on PIG-1501:
-------------------------------

The default is *not* using the compression on the intermediate data, which is 
the existing behavoir.

For RC file, it is just a bit better in terms of compression ration  than 
TFile. In terms of performance, the difference is within background noise. 
Stitching costs should be minimal. Actually, the full "projection" is the 
biggest advantage of RCFile over other columnar storage like  zebra. I was 
surprised to see the compression improvement over TFile is marginal. The only 
cause I can think of is that the compression ratio is too sensitive to the data 
to pre-determine or even pre-estimate.

lzo is under GPL. But it appears that Hadoop installation has it, at least in 
my test cluster.

> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
>                 Key: PIG-1501
>                 URL: https://issues.apache.org/jira/browse/PIG-1501
>             Project: Pig
>          Issue Type: Test
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: compress_perf_data.txt, compress_perf_data_2.txt
>
>
> We would like to understand how compressing map results as well as well as 
> reducer output in a chain of MR jobs impacts performance. We can use PigMix 
> queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to