[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631853#comment-13631853 ] Ashutosh Chauhan commented on HIVE-4014: Thanks [~tamastarjanyi] for your investigation with different version combinations. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Fix For: 0.10.0 > > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621063#comment-13621063 ] Tamas Tarjanyi commented on HIVE-4014: -- Meanwhile I have tested 4.2.0 cloudera release with so called parcel "CDH 4.2.0-1.cdh4.2.0.p0.10" https://ccp.cloudera.com/display/SUPPORT/CDH4+Downloadable+Tarballs Unfortunately it still seems to have this problem. (Did not do very rigorous testing now.) As noted above apache hive 0.10.0 seems to be fine. If you set it to "Can not Reproduce" than I am afraid this bug will not be visible to the hadoop users and some unlucky one will use the wrong version. Also I am afraid that distributors will not realize this as a problem. Can you set it to Fixed in 0.10.0? Than no more work. Issue can be closed since trunk is fine. But if somebody checks bug list this will be visible as a possible problem. Thanks. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621026#comment-13621026 ] Ashutosh Chauhan commented on HIVE-4014: bq. Okay, I cannot reproduce this on trunk, though I was consistently hitting this on hive-0.10. I'll try hive-0.10 again to be sure some other patch fixed this. Can than this be resolved as "Cannnot Reproduce" ? > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591350#comment-13591350 ] Tamas Tarjanyi commented on HIVE-4014: -- Hi Vinod, As I have stated above BAD: CDH4.1.3 - which is using hadoop-2.0.0+556 / hive-0.9.0+158 GOOD: hadoop 1.0.3 / hive 0.10.0 (apache download) GOOD: hadoop 1.0.4 / hive 0.10.0 (apache download) Meanwhile I have also tried Hortonworks Data Platform 1.2.1 GOOD: HDP1.2.1 Apache Hadoop 1.1.2-rc3 / Apache Hive 0.10.0 So it seems that the issue is in hive-0.9 now. My real problem is that both Hortonworks and Cloudera bundle hive-0.9 with hadoop-2.x.y and I wanted to use hadoop-2.x.y with hive-0.10.x and not hadoop-1. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13591058#comment-13591058 ] Vinod Kumar Vavilapalli commented on HIVE-4014: --- Okay, I cannot reproduce this on trunk, though I was consistently hitting this on hive-0.10. I'll try hive-0.10 again to be sure some other patch fixed this. [~tamastarjanyi], what version are you using? > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13589226#comment-13589226 ] Lianhui Wang commented on HIVE-4014: hi,Tamas thank you very much,you are right. also i think rcfile.reader are not very efficient. the readed column ids are transfer to rcfile.reader. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13587776#comment-13587776 ] Tamas Tarjanyi commented on HIVE-4014: -- I could not check the code but experience is different. See my results. set dfs.replication=1; set mapred.submit.replication=1; set mapred.submit.replication=1; set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions.pernode=1000; -- # Below command will create an artificial csv input file ~ 1.5Gb with 10m rows -- # i=0 ; while [ $i -lt 1000 ] ; do echo "$i,$((i%5)),12345678901234567890123456789012345678901234567890,abcdefghjkabcdefghjkabcdefghjkabcdefghjkabcdefghjk,00,,," ; ((i=i+1)) ; done >/mnt/hadoop/RCTEST.csv -- Load this data into a base table DROP TABLE RCTEST_CSV; CREATE TABLE RCTEST_CSV ( id BIGINT, counter STRING, value2 STRING, value3 STRING, value4 STRING, value5 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '\n'; LOAD DATA LOCAL INPATH "/mnt/hadoop/RCTEST.csv" INTO TABLE RCTEST_CSV; -- Multiply it into a "BAD" table. -- This will require ~30Gb altogether and will have 200m rows. DROP TABLE RCTEST_BAD; CREATE TABLE RCTEST_BAD STORED AS RCFILE AS SELECT * FROM RCTEST_CSV ; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; INSERT INTO TABLE RCTEST_BAD SELECT * FROM RCTEST_CSV; -- Create a table to simulate the expected column pruning. -- Below will require ~1.5Gb and will also have 200m rows. DROP TABLE RCTEST_GOOD; CREATE TABLE RCTEST_GOOD STORED AS RCFILE AS SELECT id,counter FROM RCTEST_BAD ; -- Lets start measuring performance. -- For this test my cluster had 1 node with 2 disks in a stripe. (Rest of the nodes were decomissioned to eliminate parallel reads accross nodes.) -- Read speed is roughly 110Mb/sec/disk. In global this means 220Mb/sec max. -- -- Please execute the below OS command to eliminate OS caching everytime before you execute any of the below SQL command!!! -- -- sync && echo 3 >/proc/sys/vm/drop_caches -- -- 1.) Should be slow. This is the reference value for full read. SELECT count(*) FROM RCTEST_BAD; -- RESULT: 158 sec -- -- 2.) Should be faster than 1 because of column pruning. SELECT count(id) FROM RCTEST_BAD; -- RESULT: 156 sec -- Actually I belive this is a proof that column pruning is not working. But lets go further. -- -- 3.) Roughly should be like 2 but faster than 1. SELECT COUNT(counter) FROM RCTEST_BAD; -- RESULT: 159 sec -- -- -- -- Lets see how it works with the simulated pruning. -- -- 4.) Should be faster than 1. In theory roughly should be same as 2 with proper pruning. SELECT count(*) from RCTEST_GOOD; -- RESULT: 29 sec -- -- 5.) Should be same as 4 since pruning does not work here either. SELECT count(id) FROM RCTEST_GOOD; -- RESULT: 31 sec -- -- 6.) Should be same as 4 since pruning does not work here either SELECT count(counter) FROM RCTEST_GOOD; -- RESULT: 31 sec -- -- -- SHORT TEST WITH A GOOD VERSION -- -- Now lets see 1. and 2. on a well working apache 1.0.4 release. -- The machine is my desktop notebook. One harddisk only. -- Amount of data is ONLY 10m rows now. -- 2.1.) Should be slow. This is the reference value for full read. SELECT count(*) FROM RCTEST_BAD; -- RESULT: 80 sec -- -- 2.2.) Should be faster than 2.1 because of column pruning. SELECT count(id) FROM RCTEST_BAD; -- RESULT: 47 sec -- YS That is what I assumed. More data would mean more performance gain especially on better hardwares. -- -- 2.5.) Should be faster than 2.1. In theory roughly should be same as 2.2 with proper pruning. SELECT count(id) from RCTEST_GOOD; -- RESULT: 44 sec -- HERE WE ARE! AS EXPECTED. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > >
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586701#comment-13586701 ] Lianhui Wang commented on HIVE-4014: i donot think that. i see the code. in HiveInputFormat and CombineHiveInputFormat's getRecordReader(), it calls pushProjectionsAndFilters(). also in pushProjectionsAndFilters(), from TableScanOperator it get needed columns and set these ids to hive.io.file.readcolumn.ids. and then in RCFile.Reader will read hive.io.file.readcolumn.ids to skip column. maybe the counter has some mistakes. if i have mistake,please tell me.thx. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586425#comment-13586425 ] Tamas Tarjanyi commented on HIVE-4014: -- I can not see affected versions here but I found the same issue on CDH4.1.3 - which is using hadoop-2.0.0+556 / hive-0.9.0+158 Than I have downloaded and tested hadoop 1.0.3 / hive 0.10.0 and hadoop 1.0.4 / hive 0.10.0 Both working fine and pruning is effective in these case. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4014) Hive+RCFile is not doing column pruning and reading much more data than necessary
[ https://issues.apache.org/jira/browse/HIVE-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13577216#comment-13577216 ] Vinod Kumar Vavilapalli commented on HIVE-4014: --- I already tracked it down, will upload a patch soon.. > Hive+RCFile is not doing column pruning and reading much more data than > necessary > - > > Key: HIVE-4014 > URL: https://issues.apache.org/jira/browse/HIVE-4014 > Project: Hive > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > > With even simple projection queries, I see that HDFS bytes read counter > doesn't show any reduction in the amount of data read. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira