[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016088#comment-13016088
]
Jason Rutherglen commented on HBASE-3529:
-----------------------------------------
@Otis We can benchmark using Lucene in conjunction with HDFS-347, of which I
have a more streamlined version of that'll be available in Github.
Implementing Solr for benchmarking would create too much overhead.
I think we may want to integrate with Solr [in the future] for out of the box
distributed queries, facets, and also to make use of the schema. I'll likely
open additional Solr related issues when we get there.
> Add search to HBase
> -------------------
>
> Key: HBASE-3529
> URL: https://issues.apache.org/jira/browse/HBASE-3529
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 0.90.0
> Reporter: Jason Rutherglen
> Attachments: HBASE-3529.patch,
> lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar,
> lucene-misc-4.0-SNAPSHOT.jar
>
>
> Using the Apache Lucene library we can add freetext search to HBase. The
> advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg,
> AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already
> architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.
> This means cascading add, update, and deletes between a Lucene index and an
> HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to
> allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly
> (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index
> to HDFS.
> * A row key will be the ID of a given Lucene document. The Lucene docstore
> will explicitly not be used because the document/row data is stored in HBase.
> We will need to solve what the best data structure for efficiently mapping a
> docid -> row key is. It could be a docstore, field cache, column stride
> fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be
> searched on, meaning there's no need for a separate search related system in
> Zookeeper.
> * Integrate search with HBase's RPC mechanism
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira