[
https://issues.apache.org/jira/browse/PIG-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791485#comment-14791485
]
Daniel Dai edited comment on PIG-4679 at 9/17/15 2:39 AM:
----------------------------------------------------------
Patch committed to trunk. Thanks Rohini for review!
was (Author: daijy):
Patch committed to trunk. Thanks Thejas for review!
> Performance degradation due to InputSizeReducerEstimator since PIG-3754
> -----------------------------------------------------------------------
>
> Key: PIG-4679
> URL: https://issues.apache.org/jira/browse/PIG-4679
> Project: Pig
> Issue Type: Bug
> Components: impl
> Reporter: Daniel Dai
> Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4679-0.patch, PIG-4679-1.patch
>
>
> On encountering a non-HDFS location in the input (for example a JOIN
> involving both HBase tables and intermediate temp files), Pig 0.14
> ReducerEstimator is returning total input size as -1 (unknown) where as in
> Pig 0.12.1 it was returning the sum of temp file sizes as the total size.
> Since -1 is returned as the input size, Pig end up using only one reducer for
> the job.
> STEPS TO REPRODUCE:
> 1. Create an HBase table with enough data. Using PerformanceEvaluation
> tool to generate data
> {code:java}
> hbase org.apache.hadoop.hbase.PerformanceEvaluation --presplit=20
> --rows=1000000 sequentialWrite 10
> {code}
> 2. Dump the table data into a file which we can then use in a Pig JOIN.
> Following Pig script generates the data file
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS
> (row_key: chararray, data: chararray);
> STORE A INTO 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|');
> {code}
> 3. Check file size to make sure that it is more than 1,000,000,000 which
> is the default bytes per reducer Pig configuration
> {code:java}
> $ hdfs dfs -count hdfs:///tmp/re_test/test_table_data
> QA: 1 41 10280000000
> hdfs:///tmp/re_test/test_table_data
> PROD: 1 57 10280000000
> hdfs:///tmp/re_test/test_table_data
> {code}
> 4. Run a Pig script that joins the HBase table with the data file. QA and
> PROD will use different number of reducers. QA (176243) should run 1 reducer
> and PROD (176258) should run 11 reducers (10,280,000,000 / 1,000,000,000)
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS
> (row_key: chararray, data: chararray);
> B = LOAD 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|') AS
> (row_key: chararray, data: chararray);
> C = JOIN A BY row_key, B BY row_key;
> STORE C INTO 'hdfs:///tmp/re_test/test_table_data_join' USING PigStorage('|');
> {code}
> Pig 0.12.1 ran 11 reduce, Pig 0.13+ run only 1 reduce.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)