[ https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nikhil Deshpande updated HIVE-1694: ----------------------------------- Status: Patch Available (was: Open) This is a patch to demonstrate query performance gains using indexes (added in HIVE-417). The patch is over latest hive trunk. ChangeLog for the patch: - Implements a new rewrite for a certain set of queries with GROUP BY to speed those queries by running them on index data instead of base table. - Implements a skeleton generic rewrite engine. - Implements the rewrite rule for a GroupBy queries set (mentioned above). More details in the class comment GbToCompactSumIdxRewrite. - Rewrite needs to be currently explicitly enabled with a flag hive.ql.rw.gb_to_idx. - Modifies metastore & metadata API for getting some index info. - Modifies QB metadata & parseblock code to add some rewrite assist methods. - Inserts a rewrite hook into Semantic Analyzer. - Fixes a bug in ql QTestUtil to clean-up indexed tables properly - Contains new test for Group By rewrite using indexes: ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q Quick performance test results on a very small Hadoop cluster: 2 queries (chosen to demonstrate perf gains) run on TPC-H benchmark data lineitem table. Timings in seconds, data set size (1M, 1G etc.) is TPC-H scale factor. {noformat} ----------------------------------------------- 1M 1G 10G 30G ----------------------------------------------- q1_no_idx 24.161 76.790 506.005 1551.555 q1_with_idx 21.268 27.292 35.502 86.133 ----------------------------------------------- q1_no_idx 73.660 130.587 764.619 2146.423 q2_with_idx 69.393 75.493 92.867 190.619 ----------------------------------------------- {noformat} Hadoop cluster description used for above perf test: - 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 16GB RAM) - 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not partitioned and clustered, Hive tables stored in row-store format, HDFS replication factor: 2 - Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM) - Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 30GB data: 21GB lineitem, ~180Million tuples) These changes are being maintained at http://github.com/prafullat/hive > Accelerate query execution using indexes > ---------------------------------------- > > Key: HIVE-1694 > URL: https://issues.apache.org/jira/browse/HIVE-1694 > Project: Hive > Issue Type: New Feature > Components: Indexing, Query Processor > Affects Versions: 0.7.0 > Reporter: Nikhil Deshpande > > The index building patch (Hive-417) is checked into trunk, this JIRA issue > tracks supporting indexes in Hive compiler & execution engine for SELECT > queries. > This is in ref. to John's comment at > https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869 > on creating separate JIRA issue for tracking index usage in optimizer & query > execution. > The aim of this effort is to use indexes to accelerate query execution (for > certain class of queries). E.g. > - Filters and range scans (already being worked on by He Yongqiang as part of > HIVE-417?) > - Joins (index based joins) > - Group By, Order By and other misc cases > The proposal is multi-step: > 1. Building index based operators, compiler and execution engine changes > 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose > between index scans, full table scans etc.) > This JIRA initially focuses on the first step. This JIRA is expected to hold > the information about index based plans & operator implementations for above > mentioned cases. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.