Add search to HBase
-------------------
Key: HBASE-3529
URL: https://issues.apache.org/jira/browse/HBASE-3529
Project: HBase
Issue Type: Improvement
Affects Versions: 0.90.0
Reporter: Jason Rutherglen
Using the Apache Lucene library we can add freetext search to HBase. The
advantages of this are:
* HBase is highly scalable and distributed
* HBase is realtime
* Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
* Lucene offers many types of queries not currently available in HBase (eg,
AND, OR, NOT, phrase, etc)
* It's easier to build scalable realtime systems on top of already
architecturally sound, scalable realtime data system, eg, HBase.
* Scaling realtime search will be as simple as scaling HBase.
Phase 1 - Indexing:
* Integrate Lucene into HBase such that an index mirrors a given region. This
means cascading add, update, and deletes between a Lucene index and an HBase
region (and vice versa).
* Define meta-data to mark a region as indexed, and use a Solr schema to allow
the user to define the fields and analyzers.
* Integrate with the HLog to ensure that index recovery can occur properly (eg,
on region server failure)
* Mirror region splits with indexes (use Lucene's IndexSplitter?)
* When a region is written to HDFS, also write the corresponding Lucene index
to HDFS.
* A row key will be the ID of a given Lucene document. The Lucene docstore
will explicitly not be used because the document/row data is stored in HBase.
We will need to solve what the best data structure for efficiently mapping a
docid -> row key is. It could be a docstore, field cache, column stride
fields, or some other mechanism.
* Write unit tests for the above
Phase 2 - Queries:
* Enable distributed Lucene queries
* Regions that have Lucene indexes are inherently available and may be searched
on, meaning there's no need for a separate search related system in Zookeeper.
* Integrate search with HBase's RPC mechanism
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira