[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13774088#comment-13774088 ] Hudson commented on HIVE-4113: -- FAILURE: Integrated in Hive-trunk-hadoop2 #450 (See [https://builds.apache.org/job/Hive-trunk-hadoop2/450/]) HIVE-4113 : Optimize select count(1) with RCFile and Orc (Brock Noland and Yin Huai via Ashutosh Chauhan) (hashutosh: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1525322) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/conf/hive-default.xml.template * /hive/trunk/contrib/src/test/results/clientpositive/serde_typedbytes.q.out * /hive/trunk/contrib/src/test/results/clientpositive/serde_typedbytes2.q.out * /hive/trunk/contrib/src/test/results/clientpositive/serde_typedbytes3.q.out * /hive/trunk/contrib/src/test/results/clientpositive/serde_typedbytes5.q.out * /hive/trunk/contrib/src/test/results/clientpositive/udf_row_sequence.q.out * /hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java * /hive/trunk/hbase-handler/src/test/results/positive/hbase_queries.q.out * /hive/trunk/hbase-handler/src/test/results/positive/hbase_single_sourced_multi_insert.q.out * /hive/trunk/hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/HCatBaseInputFormat.java * /hive/trunk/hcatalog/core/src/test/java/org/apache/hive/hcatalog/mapreduce/TestHCatPartitioned.java * /hive/trunk/hcatalog/hcatalog-pig-adapter/src/test/java/org/apache/hive/hcatalog/pig/TestHCatLoader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/Driver.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveInputFormat.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/MetadataOnlyOptimizer.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/PerformTestRCFileAndSeqFile.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestRCFile.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java * /hive/trunk/ql/src/test/queries/clientpositive/binary_table_colserde.q * /hive/trunk/ql/src/test/results/clientpositive/auto_join0.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_join15.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_join18.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_join20.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_join27.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_join30.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_join31.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_join_reordering_values.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_smb_mapjoin_14.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_sortmerge_join_10.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_sortmerge_join_6.q.out * /hive/trunk/ql/src/test/results/clientpositive/auto_sortmerge_join_9.q.out * /hive/trunk/ql/src/test/results/clientpositive/binary_output_format.q.out * /hive/trunk/ql/src/test/results/clientpositive/binary_table_colserde.q.out * /hive/trunk/ql/src/test/results/clientpositive/bucket5.q.out * /hive/trunk/ql/src/test/results/clientpositive/bucketizedhiveinputformat.q.out * /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out * /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out * /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out * /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out * /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out * /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin_negative.q.out *
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773747#comment-13773747 ] Hive QA commented on HIVE-4113: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12604367/HIVE-4113.8.patch {color:red}ERROR:{color} -1 due to 272 failed/errored test(s), 3131 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join30 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join31 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_reordering_values org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_smb_mapjoin_14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_binary_output_format org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin_negative org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin_negative2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_case_sensitivity org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cast1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cluster org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_column_access_stats org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_colname org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_uses_database_location org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_udf org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby11 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1_limit org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby6_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby7_map org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby7_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby7_noskew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby7_noskew_multi_single_reducer org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby8_map org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby8_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby8_noskew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_complex_types org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_complex_types_multi_single_reducer org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_cube1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_distinct_samekey org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_grouping_sets2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_grouping_sets3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_grouping_sets4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_grouping_sets5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_multi_insert_common_distinct org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_multi_single_reducer org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_position org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_rollup1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_auto
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773920#comment-13773920 ] Yin Huai commented on HIVE-4113: Thanks Ashutosh for updating golden files :) Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.10.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.7.patch, HIVE-4113.8.patch, HIVE-4113.9.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773974#comment-13773974 ] Hive QA commented on HIVE-4113: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12604421/HIVE-4113.11.patch {color:green}SUCCESS:{color} +1 3143 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/856/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/856/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.10.patch, HIVE-4113.11.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.7.patch, HIVE-4113.8.patch, HIVE-4113.9.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772658#comment-13772658 ] Ashutosh Chauhan commented on HIVE-4113: Its not necessary. I thought it will make code easier to read, but if its too intrusive, we can leave that for now. +1 Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773008#comment-13773008 ] Yin Huai commented on HIVE-4113: my previous patch deleted some imports Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773024#comment-13773024 ] Ashutosh Chauhan commented on HIVE-4113: Even after fixing import statements, most of auto_join* tests are failing. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773037#comment-13773037 ] Yin Huai commented on HIVE-4113: the problem is for those TableScanOperators used to load intermediate data (the output of previous stage), neededColumns are not set... I forgot this issue before... Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773113#comment-13773113 ] Yin Huai commented on HIVE-4113: .6 still has some problems... please ignore it.. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773134#comment-13773134 ] Ashutosh Chauhan commented on HIVE-4113: It seems instead of null check more elegant fix is TableScanOp always contain list of columns it wants to read, even for subsequent MR jobs. Not sure though how easy it is to fix it, probably will require changes in query plannar. Yin, can you take a quick look if its easy to fix that away. If it turns out to be quite a bit of work, we can do that in follow-up too. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773444#comment-13773444 ] Ashutosh Chauhan commented on HIVE-4113: Thanks Yin for making changes. There seems to be another bug lurking in there, which makes following queries to fail. They were failing with previous version of patch and are failing with latest one as well: {noformat} $ ant test -Dtestcase=TestCliDriver -Dmodule=ql -Dqfile=binary_table_bincolserde.q,binary_table_colserde.q,combine3.q,concatenate_inherit_table_location.q,correlationoptimizer5.q,cp_mj_rc.q,create_merge_compressed.q,ctas_hadoop20.q,date_serde.q,decimal_serde.q,drop_database_removes_partition_dirs.q,drop_table_removes_partition_dirs.q {noformat} Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.7.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773425#comment-13773425 ] Yin Huai commented on HIVE-4113: [~ashutoshc] it is pretty easy. I just spent sometime on refactoring the code to make sure we assign needed columns to all dummy TableScanOperators. However, it seems in trunk, if we need to have an individual MR job for UNION ALL, we always create a dummy TableScanOperator with a dummy conf. But, in other cases, a dummy TableScanOperator does not have a conf. I think adding the conf is better because those dummy TableScanOperators can be seen in the results of EXPLAIN. So, a bug such as HIVE-4927 can be found in a easier way. The one time cost of adding a dummy conf to a dummy TableScanOperator is that we may need to update lots of golden files... Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773624#comment-13773624 ] Yin Huai commented on HIVE-4113: {code} input = genConversionSelectOperator(dest, qb, input, table_desc, dpCtx); inputRR = opParseCtx.get(input).getRowResolver(); ArrayListColumnInfo vecCol = new ArrayListColumnInfo(); try { StructObjectInspector rowObjectInspector = (StructObjectInspector) table_desc .getDeserializer().getObjectInspector(); List? extends StructField fields = rowObjectInspector .getAllStructFieldRefs(); for (int i = 0; i fields.size(); i++) { vecCol.add(new ColumnInfo(fields.get(i).getFieldName(), TypeInfoUtils .getTypeInfoFromObjectInspector(fields.get(i) .getFieldObjectInspector()), , false)); } } catch (Exception e) { throw new SemanticException(e.getMessage(), e); } RowSchema fsRS = new RowSchema(vecCol); {code} This is the part of the code. Basically, we are trying to get Deserializer and then to construct a RowSchema for a FileSinkOperator... But I do not think we should not call getDeserializer in SemanticAnalyzer... I need to fix it. Also those SerDe classes also have some problems. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.7.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773677#comment-13773677 ] Yin Huai commented on HIVE-4113: If a test query contains a query evaluated by multiple MR jobs, the corresponding golden file will need to be updated because all dummy TableScanOperators will appear in query plans. If we do not want this kind of updates right now, we can change GenMapRedUtils.createTemporaryTableScanOperator(RowSchema) to use {code} TableScanOperator tableScanOp = (TableScanOperator) OperatorFactory.get(TableScanDesc.class, rowSchema); {code} instead of {code} TableScanOperator tableScanOp = (TableScanOperator) OperatorFactory.get(new TableScanDesc(), rowSchema); {code} Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.7.patch, HIVE-4113.8.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773695#comment-13773695 ] Hive QA commented on HIVE-4113: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12604308/HIVE-4113.7.patch {color:red}ERROR:{color} -1 due to 356 failed/errored test(s), 3131 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join30 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join31 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_reordering_values org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_smb_mapjoin_14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_binary_output_format org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_binary_table_bincolserde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_binary_table_colserde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin_negative org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin_negative2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_case_sensitivity org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cast1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cluster org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_column_access_stats org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_columnarserde_create_shortcut org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_combine3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_concatenate_inherit_table_location org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer11 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer15 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cp_mj_rc org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_merge_compressed org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_colname org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_hadoop20 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_uses_database_location org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_date_serde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_serde org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_udf org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_drop_database_removes_partition_dirs org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_drop_table_removes_partition_dirs org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_escape2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby11 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1_limit org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby5
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773717#comment-13773717 ] Ashutosh Chauhan commented on HIVE-4113: I think TS does make sense there. So, lets bite the bullet and update the golden files. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.7.patch, HIVE-4113.8.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771652#comment-13771652 ] Ashutosh Chauhan commented on HIVE-4113: Thanks, [~yhuai] for taking this one up. Its a known existing problem that predicate pushdown doesn't happen for HCatalog today. I will say that if it is getting burdensome, we can tackle that in a separate jira. I am fine with removing flag for column pruning. Its been around for a long time ( HIVE-279 ) and I haven't come across a case where user has run into problem with it. I didn't get your comment about READ_ALL_COLUMNS_DEFAULT. If we set it to true, will that imply that this optimization will be off by default, that seems like a bad choice. In HCatInputFormat, we can probably set the config such that it always select all columns for now. That way Hive will still get the benefit of optimization and hcatalog will continue with what it is doing today. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771847#comment-13771847 ] Yin Huai commented on HIVE-4113: READ_ALL_COLUMNS and READ_ALL_COLUMNS_DEFAULT are mainly created for HCat, because I think it is a kind of burden to users if they have to be aware ColumnProjectionUtils and use it every time. So, through HCat, if users do not use ColumnProjectionUtils to set needed columns, we will read all columns. If we set READ_ALL_COLUMNS_DEFAULT=false, no column will be read if a user does not use ColumnProjectionUtils. In Hive, if we get rid off the flag of column pruning, the list of neededColumnIDs in TS will not be null. Thus, in Hive, we will always set READ_ALL_COLUMNS to false (the .2 patch has an issue on it... I will fix it later). In summary, in Hive, we use neededColumnIDs in TS as the only way to tell a underlying recordreader what to read. If neededColumnIDs is an empty list, we will know no needed column. Otherwise, we will read columns specified in neededColumnIDs (if we have select * in a sub-query, neededColumnIDs should be populated to include all columns). In HCat, if a user wants to use the MapReduce interface, he or she has two ways to tell what columns are needed. 1) This user does nothing. In this case, we will read all columns. 2) This user uses utility functions in ColumnProjectionUtils (e.g. setReadColumnIDs) to specify needed columns. In this case, READ_ALL_COLUMNS will be set to false and we only read columns specified in READ_COLUMN_IDS_CONF_STR. I hope what I am proposing makes sense. I am welcome to any suggestion :) Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771905#comment-13771905 ] Ashutosh Chauhan commented on HIVE-4113: Sounds good to me. Go ahead and make the changes. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772163#comment-13772163 ] Ashutosh Chauhan commented on HIVE-4113: [~yhuai] I left some comments on RB. But, it seems like you updated the patch in meanwhile, so some of those you may have already addressed. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772317#comment-13772317 ] Yin Huai commented on HIVE-4113: please ignore those duplicated replies... Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772590#comment-13772590 ] Yin Huai commented on HIVE-4113: [~ashutoshc] Using LinkedHashSet as the type of neededColumns require changes in lots of places. Since we always do the deduplication work in ColumnProjectionUtils.getReadColumnIDs(Configuration), is it necessary to make this replacement? Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771243#comment-13771243 ] Yin Huai commented on HIVE-4113: I thought there was no flag for column pruning, so tableScan.getNeededColumnIDs(); will not be null... But, there is a flag (hive.optimize.cp)... So, when hive.optimize.cp=false, neededColumnIDs in TableScanOperator will not be set... I am so sorry that I have blocked this jira for a long time... I think Brock's patch is good. I will just rebase it and also make a minor change on comments in ColumnProjectionUtils. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771307#comment-13771307 ] Yin Huai commented on HIVE-4113: [~brocknoland] I have one question. Why do we need ColumnProjectionUtils.setReadAllColumns(jobConf); in those hcat classes (e.g. InitializeInput)? Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771309#comment-13771309 ] Brock Noland commented on HIVE-4113: Remove it and see what happens? I don't remember exactly but I thought I put that in their because if you don't specify anything now we won't read any columns while they were expecting all columns to be read. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771370#comment-13771370 ] Yin Huai commented on HIVE-4113: [~brocknoland] I see. Thanks. I am not sure if those changes will affect reading RCFile and ORC throught HCat (if we will read those unnecessary columns). Let me check. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771530#comment-13771530 ] Yin Huai commented on HIVE-4113: Three issues: # ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR is only used in Hive. Seems HCatalog does not set it. So, seems when accessing ORC through HCatalog, we cannot do predicate pushdown. # neededColumnIDs in TableScanOperator can be null when column pruning is disabled. In this case, seems we can see NPE in ColumnAccessAnalyzer.analyzeColumnAccess. Also, when column pruning is disabled, we cannot do predicate pushdown in Hive, because neededColumnIDs will be null when column pruning is disabled. # With this change, we will assume that an empty neededColumnIDs means no needed column. Either ColumnProjectionUtils.READ_ALL_COLUMNS=true or READ_COLUMN_IDS_CONF_STR having all columns can represent selecting all columns. I will make two changes. # Remove the flag of column pruning. # Set READ_ALL_COLUMNS_DEFAULT to true. So, if users of hcatalog do not use ColumnProjectionUtils, we can select all columns for them. If we use false for READ_ALL_COLUMNS_DEFAULT, users have to use ColumnProjectionUtils. Otherwise, no column will be selected. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Yin Huai Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769688#comment-13769688 ] Ashutosh Chauhan commented on HIVE-4113: In addition to what [~yhuai] suggested for RCFile, similar enhancement exist for ORC as well, as ORC stores stats (including counts) per stripe which will allow us to do almost no IO, but I will say that those enhancements will likely require changes in query processing code, so I will consider them out of scope for this jira. Lets get this one in and take up enhancements in follow-up. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769693#comment-13769693 ] Brock Noland commented on HIVE-4113: Agreed. Unfortunately I won't have time to take this up in the next few days so if someone has time and would like to see this in soon I'd be more than willing to hand it off. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769730#comment-13769730 ] Yin Huai commented on HIVE-4113: Let me take a look. Seems only a few minor changes are needed for Brock's patch. One thing I need to make sure is if we populate all columns in the list of needed columns for select * from. If so, we will not need hive.io.file.read.all.columns. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769740#comment-13769740 ] Ashutosh Chauhan commented on HIVE-4113: Thanks [~yhuai] for volunteering. Assigning it to you. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769712#comment-13769712 ] Prasanth J commented on HIVE-4113: -- HIVE-4340 will expose ORC stats through reader interfaces which can be used for optimizing count(*). Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769716#comment-13769716 ] Prasanth J commented on HIVE-4113: -- Sorry. Please ignore that comment. Row count interface already exists in ORC reader. HIVE-4340 is not relevant for this JIRA. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763425#comment-13763425 ] Ashutosh Chauhan commented on HIVE-4113: [~brocknoland] Are you still working on this? Looks like an useful optimization. If you can address [~yhuai] comments and rebase the patch, I will be happy to help review the patch. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763506#comment-13763506 ] Brock Noland commented on HIVE-4113: Thanks [~yhuai] for reviewing and thanks Ashutosh for pinging me on this. I'll try and look at how out of date this patch is within the next week. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709510#comment-13709510 ] Yin Huai commented on HIVE-4113: [~brocknoland] Seems that we use setNeededColumnIDs in TableScanOperator to set needed columns in ColumnPrunerTableScanProc (in the class of ColumnPrunerProcFactory) and neededColumnIDs in TableScanOperator will never be a null. If I am right, for code in HiveInputFormat shown below ... {code:java} // push down projections ArrayListInteger list = tableScan.getNeededColumnIDs(); if (list != null) { ColumnProjectionUtils.appendReadColumnIDs(jobConf, list); } else { ColumnProjectionUtils.setReadAllColumns(jobConf); } {\code} setReadAllColumns will never be called. Also, assuming we use RCFile, if we have 'select count(1)', we will skip all columns. Seems that we can generate correct results because from the key buffer, we will know recordsNumInValBuffer (the number of rows in a row group) and we will call 'next' recordsNumInValBuffer times. Is my understanding correct? If so, do you think we should add some comments explaining it when we set all elements of skippedColIDs to true? I think that we can take advantage of recordsNumInValBuffer to do an improvement. Instead of calling 'next' recordsNumInValBuffer times, we can pass this number directly to GroupByOperator (I have not considered if it is easy to implement). We can reduce a lot of unnecessary function calls. If we want to do this improvement, we can work on it in a separate jira. Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709686#comment-13709686 ] Hive QA commented on HIVE-4113: --- {color:green}Overall:{color}: +1 all checks pass {color:green}SUCCESS:{color} +1 all tests passed Executing org.apache.hive.ptest.execution.CleanupPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc
[ https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13684535#comment-13684535 ] Brock Noland commented on HIVE-4113: [~owen.omalley] would you have some time to look at attached patch? Thanks! Optimize select count(1) with RCFile and Orc Key: HIVE-4113 URL: https://issues.apache.org/jira/browse/HIVE-4113 Project: Hive Issue Type: Bug Components: File Formats Reporter: Gopal V Assignee: Brock Noland Fix For: 0.12.0 Attachments: HIVE-4113-0.patch select count(1) loads up every column every row when used with RCFile. select count(1) from store_sales_10_rc gives {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 31.73 sec HDFS Read: 234914410 HDFS Write: 8 SUCCESS {code} Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far less {code} Job 0: Map: 5 Reduce: 1 Cumulative CPU: 29.75 sec HDFS Read: 28145994 HDFS Write: 8 SUCCESS {code} Which is 11% of the data size read by the COUNT(1). This was tracked down to the following code in RCFile.java {code} } else { // TODO: if no column name is specified e.g, in select count(1) from tt; // skip all columns, this should be distinguished from the case: // select * from tt; for (int i = 0; i skippedColIDs.length; i++) { skippedColIDs[i] = false; } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira