Gopal V created HIVE-4926:
-----------------------------

             Summary: Queries which specify clustered-by keys as constants will 
still scan all buckets
                 Key: HIVE-4926
                 URL: https://issues.apache.org/jira/browse/HIVE-4926
             Project: Hive
          Issue Type: Improvement
    Affects Versions: 0.12.0
            Reporter: Gopal V


When tables are CLUSTERED BY (key) into multiple buckets, a query which 
specifies a key in the query predicate will still scan all buckets in the 
directory.

In the ideal scenario, only bucket needs to be inspected for a given key, 
particularly if hive.enforce.bucketing is turned on.

When a simple filter query like the following is run

{code}
select * from store_sales where ss_item_sk = 1;
{code}

The log files contain

{code}
org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000005_0
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000006_0
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000007_0
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000008_0
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000009_0
{code}

This is going through 32x the amount of data, compared to the right approach of 
scanning only the partitions which match the predicate.





--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to