Gopal V created HIVE-4926: ----------------------------- Summary: Queries which specify clustered-by keys as constants will still scan all buckets Key: HIVE-4926 URL: https://issues.apache.org/jira/browse/HIVE-4926 Project: Hive Issue Type: Improvement Affects Versions: 0.12.0 Reporter: Gopal V
When tables are CLUSTERED BY (key) into multiple buckets, a query which specifies a key in the query predicate will still scan all buckets in the directory. In the ideal scenario, only bucket needs to be inspected for a given key, particularly if hive.enforce.bucketing is turned on. When a simple filter query like the following is run {code} select * from store_sales where ss_item_sk = 1; {code} The log files contain {code} org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000005_0 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000006_0 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000007_0 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000008_0 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000009_0 {code} This is going through 32x the amount of data, compared to the right approach of scanning only the partitions which match the predicate. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira