[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikhil Deshpande updated HIVE-1694:
-----------------------------------

    Status: Patch Available  (was: Open)

This is a patch to demonstrate query performance gains using indexes
(added in HIVE-417). The patch is over latest hive trunk.

ChangeLog for the patch:
- Implements a new rewrite for a certain set of queries with GROUP BY to speed 
those queries by running them on index data instead of base table.
- Implements a skeleton generic rewrite engine.
- Implements the rewrite rule for a GroupBy queries set (mentioned above).  
More details in the class comment GbToCompactSumIdxRewrite.
- Rewrite needs to be currently explicitly enabled with a flag 
hive.ql.rw.gb_to_idx.
- Modifies metastore & metadata API for getting some index info.
- Modifies QB metadata & parseblock code to add some rewrite assist methods.
- Inserts a rewrite hook into Semantic Analyzer.
- Fixes a bug in ql QTestUtil to clean-up indexed tables properly
- Contains new test for Group By rewrite using indexes: 
ql/src/test/queries/clientpositive/ql_rewrite_gbtoidx.q

Quick performance test results on a very small Hadoop cluster:

2 queries (chosen to demonstrate perf gains) run on TPC-H benchmark data 
lineitem table.

Timings in seconds, data set size (1M, 1G etc.) is TPC-H scale factor.
{noformat}
-----------------------------------------------
               1M      1G       10G      30G 
-----------------------------------------------
  q1_no_idx  24.161   76.790  506.005  1551.555
q1_with_idx  21.268   27.292   35.502    86.133
-----------------------------------------------
  q1_no_idx  73.660  130.587  764.619  2146.423
q2_with_idx  69.393   75.493   92.867   190.619
-----------------------------------------------
{noformat}

Hadoop cluster description used for above perf test:
- 2 server class machines (each box: CentOS 5.x Linux, 5 SAS disks in RAID5, 
16GB RAM)
- 2-node Hadoop cluster (0.20.2), un-tuned and un-optimized, data not 
partitioned and clustered, Hive tables stored in row-store format, HDFS 
replication factor: 2
- Sun JDK 1.6 (server mode JVM, JVM_HEAP_SIZE:4GB RAM)
- Queries on TPC-H Data (lineitem table: 70% of TPC-H data size, e.g. TPC-H 
30GB data: 21GB lineitem, ~180Million tuples)


These changes are being maintained at http://github.com/prafullat/hive

> Accelerate query execution using indexes
> ----------------------------------------
>
>                 Key: HIVE-1694
>                 URL: https://issues.apache.org/jira/browse/HIVE-1694
>             Project: Hive
>          Issue Type: New Feature
>          Components: Indexing, Query Processor
>    Affects Versions: 0.7.0
>            Reporter: Nikhil Deshpande
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to