[jira] Updated: (HIVE-417) Implement Indexing in Hive

He Yongqiang (JIRA) Tue, 08 Jun 2010 14:49:37 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


He Yongqiang updated HIVE-417:
------------------------------

    Attachment: hive-indexing.3.patch

With this patch, the index can work. but it is not so intelligent. 

This is how this patch works:

=== how to create the index table and generate index data ===
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

drop table src_rc_index;

//create an index table on table src_rc, and the index col is key. 
//And the index table's data is stored using textfile (also work with seq, 
rcfile)
create index src_rc_index type compact on table src_rc(key) stored as textfile; 

hive> show table extended like src_rc_index;
tableName:src_rc_index
owner:heyongqiang
location:file:/user/hive/warehouse/src_rc_index
inputformat:org.apache.hadoop.mapred.TextInputFormat
outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
columns:struct columns { i32 key, string _bucketname, list<string> _offsets}

About the index table's schema. besides the index columns from the base table, 
the index table has two more columns (_bucketname string, array(string) offsets 
)


//generate the actuall index table's data (here also support partition)
update index src_rc_index;

====How to use the index table====

//find the offset for 'key=0' in the index table, and put the bucketname and 
offset list in a temp directory
insert overwrite directory "/tmp/index_result" select `_bucketname` ,  
`_offsets` from src_rc_index where key=0;

set hive.exec.index_file=/tmp/index_result; 

//use a new index file format to prune inputsplit based on the offset list 
//stored in "hive.exec.index_file" which is populated in previous command
set hive.input.format=org.apache.hadoop.hive.ql.index.io.HiveIndexInputFormat;

//this query will not scan the whole base data
select key, value from src_rc where key=0;


Things done in the patch:
1) hql command for creating index table
2) hql command and map-reduce job for updating index (generating the index 
table's data). 
3) a HiveIndexInputFormat to leverage the offsets got from index table to 
reduce number of blocks/map-tasks

Things need to be done:
1) right now the index table is manually specified in queries. we need this to 
be more intelligent by automatically generating the plan using index .
2) The HiveIndexInputFormat needs a new RecordReader to seek to a given offset 
instead of scanning the whole block. 
3) right now we use a map-reduce job to scan the whole index table to find hits 
offsets. But since the index table is sorted, we can leverage the sort property 
to avoid the map-reduce job in many cases. (easiest way is to do a binary 
search in client.)

The first todo is the most important part.  I think the third may need much 
more work (maybe not true).

(Note: although this patch has been tested in production cluster, it could 
still have bugs. We will be really appreciate if you can report bugs you find 
here.)

> Implement Indexing in Hive
> --------------------------
>
>                 Key: HIVE-417
>                 URL: https://issues.apache.org/jira/browse/HIVE-417
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore, Query Processor
>    Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
>            Reporter: Prasad Chakka
>            Assignee: He Yongqiang
>         Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
> hive-indexing.3.patch
>
>
> Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-417) Implement Indexing in Hive

Reply via email to