[
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yan Zhou updated PIG-1501:
--------------------------
Release Note:
This feature will save HDFS space used to store the intermediate data used by
PIG and potentially improve query execution speed. In general, the more
intermediate data generated, the more storage and speedup benefits.
There are no backward compatibility issues as result of this feature.
Two java properties are used to control the behavoir:
pig.tmpfilecompression, default to false, tells if the temporary files should
be compressed or not. If true, then
pig.tmpfilecompression.codec specifies which compression codec to use.
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is
under GPL license, Hadoop may need to be configured to use LZO codec. Please
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.
An example is the following "test.pig" script:
register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp,
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';
which is launched as follows:
java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo
org.apache.pig.Main ./test.pig
was:
This feature will save HDFS space used to store the intermediate data used by
PIG and potentially improve query execution speed. In general, the more
intermediate data generated, the more storage and speedup benefits.
There are no backward compatibility issues as result of this feature.
Two java properties are used to control the behavoir:
pig.tmpfilecompression, default to false, tells if the temporary files should
be compressed or not. If true, then
pig.tmpfilecompression.codec specifies which compression codec to use.
Currently, PIG only accepts "gz" and "lzo" as possible values. Since LZO is
under GPL license, Hadoop may need to be configured to use LZO codec. Please
refer to http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ for details.
An example is the following "test.pig" script:
register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
as (user, action, timespent:long, query_term, ip_addr, timestamp,
estimated_revenue, page_info, page_links);
B1 = filter A by timespent == 4;
B = load '/user/pig/tests/data/pigmix/queryterm' as (query_term);
C = join B1 by query_term, B by query_term using 'skewed' parallel 300;
D = distinct C parallel 300;
store D into 'output.lzo';
which is launched as follows:
java -cp /grid/0/gs/conf/current:/grid/0/jars/pig.jar
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo
org.apache.pig.Main ./test.pig
[ Show ยป ] Yan Zhou added a comment - 26/Aug/10 11:14 AM This feature will save
HDFS space used to store the intermediate data used by PIG and potentially
improve query execution speed. In general, the more intermediate data
generated, the more storage and speedup benefits. There are no backward
compatibility issues as result of this feature. An example is the following
"test.pig" script: register pigperf.jar; A = load
'/user/pig/tests/data/pigmix/page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action,
timespent:long, query_term, ip_addr, timestamp, estimated_revenue, page_info,
page_links); B1 = filter A by timespent == 4; B = load
'/user/pig/tests/data/pigmix/queryterm' as (query_term); C = join B1 by
query_term, B by query_term using 'skewed' parallel 300; D = distinct C
parallel 300; store D into 'output.lzo'; which is launched as follows: java -cp
/grid/0/gs/conf/current:/grid/0/jars/pig.jar
-Djava.library.path=/grid/0/gs/hadoop/current/lib/native/Linux-i386-32
-Dpig.tmpfilecompression=true -Dpig.tmpfilecompression.codec=lzo
org.apache.pig.Main ./test.pig
> need to investigate the impact of compression on pig performance
> ----------------------------------------------------------------
>
> Key: PIG-1501
> URL: https://issues.apache.org/jira/browse/PIG-1501
> Project: Pig
> Issue Type: Test
> Reporter: Olga Natkovich
> Assignee: Yan Zhou
> Fix For: 0.8.0
>
> Attachments: compress_perf_data.txt, compress_perf_data_2.txt,
> PIG-1501.patch, PIG-1501.patch, PIG-1501.patch
>
>
> We would like to understand how compressing map results as well as well as
> reducer output in a chain of MR jobs impacts performance. We can use PigMix
> queries for this investigation.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.