[
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13046258#comment-13046258
]
Otis Gospodnetic commented on HBASE-3529:
-----------------------------------------
A few more comments/questions for Jason:
* I see PKIndexSplitter usage for splitting the index when a region splits. I
see you split the index, open 2 IndexWriters for 2 new Lucene indices, but then
either you are not adding documents to them, or I'm not seeing it?
* Are there issues around distributed search? It looks like it wasn't in your
github branch.
* What will happen when a region changes its location/regionserver for whatever
reason? I see HDFS-2004 got -1ed and you said without that search will be
slow. Do you have an alternative plan?
* What is the reason for storing those 2 extra row fields? (the UID one at the
other one... I think it's called rowStr or something like that)
* What about storing the index in HBase itself? (a la Solandra, I suppose)
Would this be doable? Would it make things simpler in the sense that any
splitting or moving around, etc. may be handled by HBase and we wouldn't have
to make sure the Lucene index always mirrors what's in a region and make sure
it follows the region wherever it goes? Lars' idea/question, and I hope I
didn't misunderstand or misrepresent his ideas.
> Add search to HBase
> -------------------
>
> Key: HBASE-3529
> URL: https://issues.apache.org/jira/browse/HBASE-3529
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 0.90.0
> Reporter: Jason Rutherglen
> Attachments: HBASE-3529.patch
>
>
> Using the Apache Lucene library we can add freetext search to HBase. The
> advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg,
> AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already
> architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.
> This means cascading add, update, and deletes between a Lucene index and an
> HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to
> allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly
> (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index
> to HDFS.
> * A row key will be the ID of a given Lucene document. The Lucene docstore
> will explicitly not be used because the document/row data is stored in HBase.
> We will need to solve what the best data structure for efficiently mapping a
> docid -> row key is. It could be a docstore, field cache, column stride
> fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be
> searched on, meaning there's no need for a separate search related system in
> Zookeeper.
> * Integrate search with HBase's RPC mechanism
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira