[ 
https://issues.apache.org/jira/browse/HIVE-4926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-4926:
--------------------------

    Component/s: Query Processor
         Labels: perfomance  (was: )
    
> Queries which specify clustered-by keys as constants will still scan all 
> buckets
> --------------------------------------------------------------------------------
>
>                 Key: HIVE-4926
>                 URL: https://issues.apache.org/jira/browse/HIVE-4926
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.12.0
>            Reporter: Gopal V
>              Labels: perfomance
>         Attachments: HIVE-4926-test.tgz
>
>
> When tables are CLUSTERED BY (key) into multiple buckets, a query which 
> specifies a key in the query predicate will still scan all buckets in the 
> directory.
> In the ideal scenario, only bucket needs to be inspected for a given key, 
> particularly if hive.enforce.bucketing is turned on.
> When a simple filter query like the following is run
> {code}
> select * from store_sales where ss_item_sk = 1;
> {code}
> The log files contain
> {code}
> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000005_0
>  org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000006_0
>  org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000007_0
>  org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000008_0
>  org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader: Processing file 
> hdfs://hadoop1.lxc:56565/user/hive/warehouse/hive_bucketed.db/store_sales/000009_0
> {code}
> This is going through 32x the amount of data, compared to the right approach 
> of scanning only the partitions which match the predicate.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to