[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13774088#comment-13774088
 ] 

Hudson commented on HIVE-4113:
--

FAILURE: Integrated in Hive-trunk-hadoop2 #450 (See 
[https://builds.apache.org/job/Hive-trunk-hadoop2/450/])
HIVE-4113 : Optimize select count(1) with RCFile and Orc (Brock Noland and Yin 
Huai via Ashutosh Chauhan) (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1525322)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/contrib/src/test/results/clientpositive/serde_typedbytes.q.out
* /hive/trunk/contrib/src/test/results/clientpositive/serde_typedbytes2.q.out
* /hive/trunk/contrib/src/test/results/clientpositive/serde_typedbytes3.q.out
* /hive/trunk/contrib/src/test/results/clientpositive/serde_typedbytes5.q.out
* /hive/trunk/contrib/src/test/results/clientpositive/udf_row_sequence.q.out
* 
/hive/trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java
* /hive/trunk/hbase-handler/src/test/results/positive/hbase_queries.q.out
* 
/hive/trunk/hbase-handler/src/test/results/positive/hbase_single_sourced_multi_insert.q.out
* 
/hive/trunk/hcatalog/core/src/main/java/org/apache/hive/hcatalog/mapreduce/HCatBaseInputFormat.java
* 
/hive/trunk/hcatalog/core/src/test/java/org/apache/hive/hcatalog/mapreduce/TestHCatPartitioned.java
* 
/hive/trunk/hcatalog/hcatalog-pig-adapter/src/test/java/org/apache/hive/hcatalog/pig/TestHCatLoader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/Driver.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchTask.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SMBMapJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/BucketizedHiveInputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFile.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileRecordReader.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/MetadataOnlyOptimizer.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/PerformTestRCFileAndSeqFile.java
* /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/TestRCFile.java
* 
/hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java
* /hive/trunk/ql/src/test/queries/clientpositive/binary_table_colserde.q
* /hive/trunk/ql/src/test/results/clientpositive/auto_join0.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_join15.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_join18.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/auto_join18_multi_distinct.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_join20.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_join27.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_join30.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_join31.q.out
* 
/hive/trunk/ql/src/test/results/clientpositive/auto_join_reordering_values.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_smb_mapjoin_14.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_sortmerge_join_10.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_sortmerge_join_6.q.out
* /hive/trunk/ql/src/test/results/clientpositive/auto_sortmerge_join_9.q.out
* /hive/trunk/ql/src/test/results/clientpositive/binary_output_format.q.out
* /hive/trunk/ql/src/test/results/clientpositive/binary_table_colserde.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucket5.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketizedhiveinputformat.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out
* /hive/trunk/ql/src/test/results/clientpositive/bucketmapjoin_negative.q.out
* 

[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773747#comment-13773747
 ] 

Hive QA commented on HIVE-4113:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12604367/HIVE-4113.8.patch

{color:red}ERROR:{color} -1 due to 272 failed/errored test(s), 3131 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join30
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join31
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_reordering_values
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_smb_mapjoin_14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_6
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_9
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_binary_output_format
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin_negative
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin_negative2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_case_sensitivity
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cast1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cluster
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_column_access_stats
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_colname
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_uses_database_location
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_udf
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby11
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1_limit
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1_map_skew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3_map_skew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby6
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby6_map_skew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby7_map
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby7_map_skew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby7_noskew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby7_noskew_multi_single_reducer
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby8_map
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby8_map_skew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby8_noskew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby9
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_complex_types
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_complex_types_multi_single_reducer
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_cube1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_distinct_samekey
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_grouping_sets2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_grouping_sets3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_grouping_sets4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_grouping_sets5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_multi_insert_common_distinct
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_multi_single_reducer
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_position
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_rollup1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_7
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_auto

[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-21 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773920#comment-13773920
 ] 

Yin Huai commented on HIVE-4113:


Thanks Ashutosh for updating golden files :)

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.10.patch, 
 HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, HIVE-4113.4.patch, 
 HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.7.patch, HIVE-4113.8.patch, 
 HIVE-4113.9.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773974#comment-13773974
 ] 

Hive QA commented on HIVE-4113:
---



{color:green}Overall{color}: +1 all checks pass

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12604421/HIVE-4113.11.patch

{color:green}SUCCESS:{color} +1 3143 tests passed

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/856/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/856/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.10.patch, 
 HIVE-4113.11.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, HIVE-4113.3.patch, 
 HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, HIVE-4113.7.patch, 
 HIVE-4113.8.patch, HIVE-4113.9.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772658#comment-13772658
 ] 

Ashutosh Chauhan commented on HIVE-4113:


Its not necessary. I thought it will make code easier to read, but if its too 
intrusive, we can leave that for now.
+1

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773008#comment-13773008
 ] 

Yin Huai commented on HIVE-4113:


my previous patch deleted some imports

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773024#comment-13773024
 ] 

Ashutosh Chauhan commented on HIVE-4113:


Even after fixing import statements, most of auto_join* tests are  failing.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773037#comment-13773037
 ] 

Yin Huai commented on HIVE-4113:


the problem is for those TableScanOperators used to load intermediate data (the 
output of previous stage), neededColumns are not set... I forgot this issue 
before...

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.patch, 
 HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773113#comment-13773113
 ] 

Yin Huai commented on HIVE-4113:


.6 still has some problems... please ignore it..

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, 
 HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773134#comment-13773134
 ] 

Ashutosh Chauhan commented on HIVE-4113:


It seems instead of null check more elegant fix is TableScanOp always contain 
list of columns it wants to read, even for subsequent MR jobs. Not sure though 
how easy it is to fix it, probably will require changes in query plannar. Yin, 
can you take a quick look if its easy to fix that away. If it turns out to be 
quite a bit of work, we can do that in follow-up too.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, 
 HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773444#comment-13773444
 ] 

Ashutosh Chauhan commented on HIVE-4113:


Thanks Yin for making changes. There seems to be another bug lurking in there, 
which makes following queries to fail. They were failing with previous version 
of patch and are failing with latest one as well:
{noformat}
$ ant test -Dtestcase=TestCliDriver -Dmodule=ql 
-Dqfile=binary_table_bincolserde.q,binary_table_colserde.q,combine3.q,concatenate_inherit_table_location.q,correlationoptimizer5.q,cp_mj_rc.q,create_merge_compressed.q,ctas_hadoop20.q,date_serde.q,decimal_serde.q,drop_database_removes_partition_dirs.q,drop_table_removes_partition_dirs.q

{noformat}

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, 
 HIVE-4113.7.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773425#comment-13773425
 ] 

Yin Huai commented on HIVE-4113:


[~ashutoshc] it is pretty easy. I just spent sometime on refactoring the code 
to make sure we assign needed columns to all dummy TableScanOperators. However, 
it seems in trunk, if we need to have an individual MR job for UNION ALL, we 
always create a dummy TableScanOperator with a dummy conf. But, in other cases, 
a dummy TableScanOperator does not have a conf. I think adding the conf is 
better because those dummy TableScanOperators can be seen in the results of 
EXPLAIN. So, a bug such as HIVE-4927 can be found in a easier way. The one time 
cost of adding a dummy conf to a dummy TableScanOperator is that we may need to 
update lots of golden files...

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, 
 HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773624#comment-13773624
 ] 

Yin Huai commented on HIVE-4113:


{code}
input = genConversionSelectOperator(dest, qb, input, table_desc, dpCtx);
inputRR = opParseCtx.get(input).getRowResolver();

ArrayListColumnInfo vecCol = new ArrayListColumnInfo();

try {
  StructObjectInspector rowObjectInspector = (StructObjectInspector) 
table_desc
  .getDeserializer().getObjectInspector();
  List? extends StructField fields = rowObjectInspector
  .getAllStructFieldRefs();
  for (int i = 0; i  fields.size(); i++) {
vecCol.add(new ColumnInfo(fields.get(i).getFieldName(), TypeInfoUtils
.getTypeInfoFromObjectInspector(fields.get(i)
.getFieldObjectInspector()), , false));
  }
} catch (Exception e) {
  throw new SemanticException(e.getMessage(), e);
}

RowSchema fsRS = new RowSchema(vecCol);
{code}

This is the part of the code. Basically, we are trying to get Deserializer and 
then to construct a RowSchema for a FileSinkOperator... But I do not think we 
should not call getDeserializer in SemanticAnalyzer... I need to fix it. Also 
those SerDe classes also have some problems.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, 
 HIVE-4113.7.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773677#comment-13773677
 ] 

Yin Huai commented on HIVE-4113:


If a test query contains a query evaluated by multiple MR jobs, the 
corresponding golden file will need to be updated because all dummy 
TableScanOperators will appear in query plans. If we do not want this kind of 
updates right now, we can change 
GenMapRedUtils.createTemporaryTableScanOperator(RowSchema) to use 
{code}
TableScanOperator tableScanOp = (TableScanOperator) 
OperatorFactory.get(TableScanDesc.class, rowSchema);
{code}
instead of 
{code}
TableScanOperator tableScanOp = (TableScanOperator) OperatorFactory.get(new 
TableScanDesc(), rowSchema);
{code}

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, 
 HIVE-4113.7.patch, HIVE-4113.8.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773695#comment-13773695
 ] 

Hive QA commented on HIVE-4113:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12604308/HIVE-4113.7.patch

{color:red}ERROR:{color} -1 due to 356 failed/errored test(s), 3131 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join30
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join31
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join_reordering_values
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_smb_mapjoin_14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_6
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_9
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_binary_output_format
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_binary_table_bincolserde
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_binary_table_colserde
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin_negative
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_bucketmapjoin_negative2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_case_sensitivity
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cast1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cluster
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_column_access_stats
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_columnarserde_create_shortcut
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_combine3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_concatenate_inherit_table_location
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer11
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer12
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer13
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer14
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer15
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer5
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer6
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer7
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer8
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_correlationoptimizer9
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cp_mj_rc
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_create_merge_compressed
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_colname
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_hadoop20
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ctas_uses_database_location
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_date_serde
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_serde
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_decimal_udf
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_drop_database_removes_partition_dirs
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_drop_table_removes_partition_dirs
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_escape2
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_join_breaktask
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby10
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby11
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1_limit
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby1_map_skew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby3_map_skew
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby4
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby5

[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-20 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13773717#comment-13773717
 ] 

Ashutosh Chauhan commented on HIVE-4113:


I think TS does make sense there. So, lets bite the bullet and update the 
golden files.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.5.patch, HIVE-4113.6.patch, 
 HIVE-4113.7.patch, HIVE-4113.8.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771652#comment-13771652
 ] 

Ashutosh Chauhan commented on HIVE-4113:


Thanks, [~yhuai] for taking this one up.
Its a known existing problem that predicate pushdown doesn't happen for 
HCatalog today. I will say that if it is getting burdensome, we can tackle that 
in a separate jira. 
I am fine with removing flag for column pruning. Its been around for a long 
time ( HIVE-279 ) and I haven't come across a case where user has run into 
problem with it.
I didn't get your comment about READ_ALL_COLUMNS_DEFAULT. If we set it to true, 
will that imply that this optimization will be off by default, that seems like 
a bad choice. In HCatInputFormat, we can probably set the config such that it 
always select all columns for now. That way Hive will still get the benefit of 
optimization and hcatalog will continue with what it is doing today. 

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771847#comment-13771847
 ] 

Yin Huai commented on HIVE-4113:


READ_ALL_COLUMNS and READ_ALL_COLUMNS_DEFAULT are mainly created for HCat, 
because I think it is a kind of burden to users if they have to be aware 
ColumnProjectionUtils and use it every time. So, through HCat, if users do not 
use ColumnProjectionUtils to set needed columns, we will read all columns. If 
we set READ_ALL_COLUMNS_DEFAULT=false, no column will be read if a user does 
not use ColumnProjectionUtils.

In Hive, if we get rid off the flag of column pruning, the list of 
neededColumnIDs in TS will not be null. Thus, in Hive, we will always set 
READ_ALL_COLUMNS to false (the .2 patch has an issue on it... I will fix it 
later).

In summary, in Hive, we use neededColumnIDs in TS as the only way to tell a 
underlying recordreader what to read. If neededColumnIDs is an empty list, we 
will know no needed column. Otherwise, we will read columns specified in 
neededColumnIDs (if we have select * in a sub-query, neededColumnIDs should be 
populated to include all columns).

In HCat, if a user wants to use the MapReduce interface, he or she has two ways 
to tell what columns are needed. 1) This user does nothing. In this case, we 
will read all columns. 2) This user uses utility functions in 
ColumnProjectionUtils (e.g. setReadColumnIDs) to specify needed columns. In 
this case, READ_ALL_COLUMNS will be set to false and we only read columns 
specified in READ_COLUMN_IDS_CONF_STR.

I hope what I am proposing makes sense. I am welcome to any suggestion :)

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771905#comment-13771905
 ] 

Ashutosh Chauhan commented on HIVE-4113:


Sounds good to me. Go ahead and make the changes.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772163#comment-13772163
 ] 

Ashutosh Chauhan commented on HIVE-4113:


[~yhuai] I left some comments on RB. But, it seems like you updated the patch 
in meanwhile, so some of those you may have already addressed.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772317#comment-13772317
 ] 

Yin Huai commented on HIVE-4113:


please ignore those duplicated replies... 

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-19 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13772590#comment-13772590
 ] 

Yin Huai commented on HIVE-4113:


[~ashutoshc] Using LinkedHashSet as the type of neededColumns require changes 
in lots of places. Since we always do the deduplication work in 
ColumnProjectionUtils.getReadColumnIDs(Configuration), is it necessary to make 
this replacement?

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.2.patch, 
 HIVE-4113.3.patch, HIVE-4113.4.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771243#comment-13771243
 ] 

Yin Huai commented on HIVE-4113:


I thought there was no flag for column pruning, so 
tableScan.getNeededColumnIDs(); will not be null... But, there is a flag 
(hive.optimize.cp)... So, when hive.optimize.cp=false, neededColumnIDs in 
TableScanOperator will not be set... I am so sorry that I have blocked this 
jira for a long time... I think Brock's patch is good. I will just rebase it 
and also make a minor change on comments in ColumnProjectionUtils.



 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771307#comment-13771307
 ] 

Yin Huai commented on HIVE-4113:


[~brocknoland] I have one question. Why do we need 
ColumnProjectionUtils.setReadAllColumns(jobConf); in those hcat classes (e.g. 
InitializeInput)?

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-18 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771309#comment-13771309
 ] 

Brock Noland commented on HIVE-4113:


Remove it and see what happens? I don't remember exactly but I thought I put 
that in their because if you don't specify anything now we won't read any 
columns while they were expecting all columns to be read.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771370#comment-13771370
 ] 

Yin Huai commented on HIVE-4113:


[~brocknoland] I see. Thanks. I am not sure if those changes will affect 
reading RCFile and ORC throught HCat (if we will read those unnecessary 
columns). Let me check.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.patch, 
 HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-18 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13771530#comment-13771530
 ] 

Yin Huai commented on HIVE-4113:


Three issues:
# ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR is only used in Hive. Seems 
HCatalog does not set it. So, seems when accessing ORC through HCatalog, we 
cannot do predicate pushdown.
# neededColumnIDs in TableScanOperator can be null when column pruning is 
disabled. In this case, seems we can see NPE in 
ColumnAccessAnalyzer.analyzeColumnAccess. Also, when column pruning is 
disabled, we cannot do predicate pushdown in Hive, because neededColumnIDs will 
be null when column pruning is disabled.
# With this change, we will assume that an empty neededColumnIDs means no 
needed column. Either ColumnProjectionUtils.READ_ALL_COLUMNS=true or 
READ_COLUMN_IDS_CONF_STR having all columns can represent selecting all columns.

I will make two changes.
# Remove the flag of column pruning.
# Set READ_ALL_COLUMNS_DEFAULT to true. So, if users of hcatalog do not use 
ColumnProjectionUtils, we can select all columns for them. If we use false for 
READ_ALL_COLUMNS_DEFAULT, users have to use ColumnProjectionUtils. Otherwise, 
no column will be selected.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Yin Huai
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.patch, 
 HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-17 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769688#comment-13769688
 ] 

Ashutosh Chauhan commented on HIVE-4113:


In addition to what [~yhuai] suggested for RCFile, similar enhancement exist 
for ORC as well, as ORC stores stats (including counts) per stripe which will 
allow us to do almost no IO, but I will say that those enhancements will likely 
require changes in query processing code, so I will consider them out of scope 
for this jira. Lets get this one in and take up enhancements in follow-up. 

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-17 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769693#comment-13769693
 ] 

Brock Noland commented on HIVE-4113:


Agreed. Unfortunately I won't have time to take this up in the next few days so 
if someone has time and would like to see this in soon I'd be more than willing 
to hand it off.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-17 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769730#comment-13769730
 ] 

Yin Huai commented on HIVE-4113:


Let me take a look. Seems only a few minor changes are needed for Brock's 
patch. One thing I need to make sure is if we populate all columns in the list 
of needed columns for select * from. If so, we will not need 
hive.io.file.read.all.columns.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-17 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769740#comment-13769740
 ] 

Ashutosh Chauhan commented on HIVE-4113:


Thanks [~yhuai] for volunteering. Assigning it to you.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-17 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769712#comment-13769712
 ] 

Prasanth J commented on HIVE-4113:
--

HIVE-4340 will expose ORC stats through reader interfaces which can be used for 
optimizing count(*).

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-17 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13769716#comment-13769716
 ] 

Prasanth J commented on HIVE-4113:
--

Sorry. Please ignore that comment. Row count interface already exists in ORC 
reader. HIVE-4340 is not relevant for this JIRA.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-10 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763425#comment-13763425
 ] 

Ashutosh Chauhan commented on HIVE-4113:


[~brocknoland] Are you still working on this? Looks like an useful 
optimization. If you can address [~yhuai] comments and rebase the patch, I will 
be happy to help review the patch.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-09-10 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763506#comment-13763506
 ] 

Brock Noland commented on HIVE-4113:


Thanks [~yhuai] for reviewing and thanks Ashutosh for pinging me on this. I'll 
try and look at how out of date this patch is within the next week.

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-07-16 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709510#comment-13709510
 ] 

Yin Huai commented on HIVE-4113:


[~brocknoland] Seems that we use setNeededColumnIDs in TableScanOperator to set 
needed columns in ColumnPrunerTableScanProc (in the class of 
ColumnPrunerProcFactory) and neededColumnIDs in TableScanOperator will never be 
a null. If I am right, for code in HiveInputFormat shown below ...
{code:java}
// push down projections
ArrayListInteger list = tableScan.getNeededColumnIDs();
if (list != null) {
  ColumnProjectionUtils.appendReadColumnIDs(jobConf, list);
} else {
  ColumnProjectionUtils.setReadAllColumns(jobConf);
}
{\code}
setReadAllColumns will never be called.

Also, assuming we use RCFile, if we have 'select count(1)', we will skip all 
columns. Seems that we can generate correct results because from the key 
buffer, we will know recordsNumInValBuffer (the number of rows in a row group) 
and we will call 'next' recordsNumInValBuffer times. Is my understanding 
correct? If so, do you think we should add some comments explaining it when we 
set all elements of skippedColIDs to true? I think that we can take advantage 
of recordsNumInValBuffer to do an improvement. Instead of calling 'next' 
recordsNumInValBuffer times, we can pass this number directly to 
GroupByOperator (I have not considered if it is easy to implement). We can 
reduce a lot of unnecessary function calls. If we want to do this improvement, 
we can work on it in a separate jira. 

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-07-16 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13709686#comment-13709686
 ] 

Hive QA commented on HIVE-4113:
---



{color:green}Overall:{color}: +1 all checks pass

{color:green}SUCCESS:{color} +1 all tests passed

Executing org.apache.hive.ptest.execution.CleanupPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch, HIVE-4113.patch, HIVE-4113.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

2013-06-15 Thread Brock Noland (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13684535#comment-13684535
 ] 

Brock Noland commented on HIVE-4113:


[~owen.omalley] would you have some time to look at attached patch? Thanks!

 Optimize select count(1) with RCFile and Orc
 

 Key: HIVE-4113
 URL: https://issues.apache.org/jira/browse/HIVE-4113
 Project: Hive
  Issue Type: Bug
  Components: File Formats
Reporter: Gopal V
Assignee: Brock Noland
 Fix For: 0.12.0

 Attachments: HIVE-4113-0.patch


 select count(1) loads up every column  every row when used with RCFile.
 select count(1) from store_sales_10_rc gives
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
 HDFS Write: 8 SUCCESS
 {code}
 Where as, select count(ss_sold_date_sk) from store_sales_10_rc; reads far 
 less
 {code}
 Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
 HDFS Write: 8 SUCCESS
 {code}
 Which is 11% of the data size read by the COUNT(1).
 This was tracked down to the following code in RCFile.java
 {code}
   } else {
 // TODO: if no column name is specified e.g, in select count(1) from 
 tt;
 // skip all columns, this should be distinguished from the case:
 // select * from tt;
 for (int i = 0; i  skippedColIDs.length; i++) {
   skippedColIDs[i] = false;
 }
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira