[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Yongqiang updated HIVE-417: ------------------------------ Attachment: hive-indexing.3.patch With this patch, the index can work. but it is not so intelligent. This is how this patch works: === how to create the index table and generate index data === set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; drop table src_rc_index; //create an index table on table src_rc, and the index col is key. //And the index table's data is stored using textfile (also work with seq, rcfile) create index src_rc_index type compact on table src_rc(key) stored as textfile; hive> show table extended like src_rc_index; tableName:src_rc_index owner:heyongqiang location:file:/user/hive/warehouse/src_rc_index inputformat:org.apache.hadoop.mapred.TextInputFormat outputformat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat columns:struct columns { i32 key, string _bucketname, list<string> _offsets} About the index table's schema. besides the index columns from the base table, the index table has two more columns (_bucketname string, array(string) offsets ) //generate the actuall index table's data (here also support partition) update index src_rc_index; ====How to use the index table==== //find the offset for 'key=0' in the index table, and put the bucketname and offset list in a temp directory insert overwrite directory "/tmp/index_result" select `_bucketname` , `_offsets` from src_rc_index where key=0; set hive.exec.index_file=/tmp/index_result; //use a new index file format to prune inputsplit based on the offset list //stored in "hive.exec.index_file" which is populated in previous command set hive.input.format=org.apache.hadoop.hive.ql.index.io.HiveIndexInputFormat; //this query will not scan the whole base data select key, value from src_rc where key=0; Things done in the patch: 1) hql command for creating index table 2) hql command and map-reduce job for updating index (generating the index table's data). 3) a HiveIndexInputFormat to leverage the offsets got from index table to reduce number of blocks/map-tasks Things need to be done: 1) right now the index table is manually specified in queries. we need this to be more intelligent by automatically generating the plan using index . 2) The HiveIndexInputFormat needs a new RecordReader to seek to a given offset instead of scanning the whole block. 3) right now we use a map-reduce job to scan the whole index table to find hits offsets. But since the index table is sorted, we can leverage the sort property to avoid the map-reduce job in many cases. (easiest way is to do a binary search in client.) The first todo is the most important part. I think the third may need much more work (maybe not true). (Note: although this patch has been tested in production cluster, it could still have bugs. We will be really appreciate if you can report bugs you find here.) > Implement Indexing in Hive > -------------------------- > > Key: HIVE-417 > URL: https://issues.apache.org/jira/browse/HIVE-417 > Project: Hadoop Hive > Issue Type: New Feature > Components: Metastore, Query Processor > Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 > Reporter: Prasad Chakka > Assignee: He Yongqiang > Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, > hive-indexing.3.patch > > > Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.