[
https://issues.apache.org/jira/browse/PIG-4512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516745#comment-14516745
]
Ángel Álvarez commented on PIG-4512:
------------------------------------
I've sorted the data as Daniel suggested, and this is what I've got:
T1 T2 T3
T4 Average
HCatLoader 48134 46217 55369 54358 = 51019.5 ms
OrcStorage 44290 49200 49984 50767 = 48560.25 ms
PigStorage 19307 24092 20952 24774 = 22281.25 ms
OrcStorage only improves HCatLoader by no more than 2 or 3 seconds on average.
The curious thing, PigStorage is the clearest winner (by far). Splitting the
file before importing to Hive, however, seems not to have any significant
influence.
On the other hand, predicate pushdown is enabled in Hive by default
(https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties):
hive.optimize.ppd
Default Value: true
Added In: Hive 0.4.0
Whether to enable predicate pushdown (PPD).
So, if I try to do more or less the same operation in Hive
export HADOOP_OPTS="-Dhive.execution.engine=tez"
hive -e "select uri,count(*) from nasadata_orc where uri=='test' group by uri;"
The one-row result is obtained in only 14048.25 ms (on average). Does this
mean my test in PIg is not using Predicate Pushdown?
> No performance improvement using OrcStorage
> -------------------------------------------
>
> Key: PIG-4512
> URL: https://issues.apache.org/jira/browse/PIG-4512
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.14.0
> Environment: Hortonworks 2.2, Pig 14.0, Hive 0.14.0, Tez
> Reporter: Ángel Álvarez
> Priority: Minor
>
> I've been doing some tests with Pig & Hive, trying to gain some performance
> using the OrcStorage class and his "Predicate Push Down" loader. I've
> followed the next steps:
> 1, Download a dataset
> ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz
> 2. Create a new larger file by copying the same original file multiple times.
> cat NASA_access_log_Aug95 NASA_access_log_Aug95 ... > NASA
> 3. Add a new line in the data file
> echo 'slppp6.intermind.net - - [01/Aug/1995:00:00:11 -0400] "GET test
> HTTP/1.0" 200 9202' >> NASA
> and split the file into different parts
> split -l 1000000 NASA NASA.
> 4. Create the ORC table in Hive
> DROP TABLE nasadata_txt;
> DROP TABLE nasadata_orc;
> CREATE TABLE nasadata_txt(ip VARCHAR(50), user_identifier VARCHAR(50),
> user_id VARCHAR(50),date_time VARCHAR(50),zone VARCHAR(10),method
> VARCHAR(5),uri VARCHAR(200),version VARCHAR(10),status DECIMAL(3,0),size
> DECIMAL(10,0)) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS
> TEXTFILE;
> CREATE TABLE nasadata_orc(ip VARCHAR(50), user_identifier VARCHAR(50),
> user_id VARCHAR(50),date_time VARCHAR(50),zone VARCHAR(10),method
> VARCHAR(5),uri VARCHAR(200),version VARCHAR(10),status DECIMAL(3,0),size
> DECIMAL(10,0)) STORED AS ORC;
> -- Load into Text table
> LOAD DATA LOCAL INPATH 'NASA.*' INTO TABLE nasadata_txt;
> -- Copy to ORC table
> INSERT OVERWRITE TABLE nasadata_orc SELECT * FROM nasadata_txt;
> 5. Execute this pig script
> rmf /tmp/pruebaPPD;
> A = LOAD '/apps/hive/warehouse/nasadata_orc' using OrcStorage() as
> (ip,user_identifier,user_id,date_time,zone,method,uri,version,status,size);
> A = foreach A generate ip,uri,status;
> A = filter A by uri == 'test';
> A = group A by uri;
> A = foreach A generate group,COUNT(*);
> store A into '/tmp/pruebaPPD' using PigStorage(';');
> 6. Execute the previous script replacing OrcStorage by
> org.apache.hive.hcatalog.pig.HCatLoader.
> I can't see any difference in performance between using OrcStorage and
> HCatLoader. Is there anything wrong in what I'm doing? Do I have to set any
> property?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)