[jira] Commented: (HBASE-3529) Add search to HBase

Jason Rutherglen (JIRA) Sat, 26 Feb 2011 06:44:28 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12999789#comment-12999789
 ]


Jason Rutherglen commented on HBASE-3529:
-----------------------------------------

It looks simple to change HDFS-347 (the HDFS-347-branch-20-append.txt patch) to 
read using positional reads, I'm sure it's necessary as a block reader is 
instantiated per DFSInputStream? read(long position, byte[] buffer, int offset, 
int length) calls getBlockRange which is sync'd.  Then the read method calls 
fetchBlockByteRange which calls BlockReader.newBlockReader, eg, the blockreader 
is per thread and isn't reused?  So the contention would be in getBlockRange?  
Perhaps there's not an issue, or not much of one, if the 
HDFS-347-branch-20-append.txt patch (or something like it) is applied (using 
HADOOP-6311)?  

I guess the go ahead is to write a Lucene Directory that uses HDFS underneath, 
that gains concurrency by using DFSInputStream.read(long position, ...)?  Oh, 
the other issue would be all the overhead from simply loading a byte[1024] (eg, 
all the new object creation etc).  Hmm... That'll be a problem.  

> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>
> Using the Apache Lucene library we can add freetext search to HBase.  The 
> advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, 
> AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already 
> architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  
> This means cascading add, update, and deletes between a Lucene index and an 
> HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to 
> allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly 
> (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index 
> to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore 
> will explicitly not be used because the document/row data is stored in HBase. 
>  We will need to solve what the best data structure for efficiently mapping a 
> docid -> row key is.  It could be a docstore, field cache, column stride 
> fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be 
> searched on, meaning there's no need for a separate search related system in 
> Zookeeper.
> * Integrate search with HBase's RPC mechanism

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (HBASE-3529) Add search to HBase

Reply via email to