[jira] [Commented] (HBASE-3529) Add search to HBase

Jason Rutherglen (JIRA) Thu, 14 Apr 2011 09:01:50 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13019879#comment-13019879
 ]


Jason Rutherglen commented on HBASE-3529:
-----------------------------------------

Here are some basic benchmark numbers.  The code is more or less pushed to 
Github.  I need to verify it all works for a clean download of the various 
parts, of which there are 3, Lucene, HDFS-347 Hadoop 0.20 append modified, and 
HBase with Search. 

The architecture is to write out a single block per Lucene file.  In this way 
we can simply obtain one underlying java.io.File directly from the DFSClient.  
The file is then MMap'ed using a modified version of the MMapDirectory called 
HDFSDirectory.

The benchmark shows that storing Lucene indexes into HDFS and reading directly 
from HDFS is viable (as opposed to copying the files out of HDFS first to the 
local filesystem).

Here are times in milliseconds, on the Wiki-EN corpus:

lucene indexing duration: 50202
lucene query time #1: 11780
lucene query time #2: 6211
lucene query time #3: 6181

hbase indexing duration: 70681
hbase query time #1: 8332
hbase query time #2: 6785
hbase query time #3: 6621

As you can see, the indexing is a little bit slower when writing to HDFS.  
However with the new changes going into Lucene (ie, LUCENE-2324), a pause when 
flushing (due to HDFS overhead) will not slow down indexing.  So expect 
indexing parity soon.

The main query times to look at are the #2 and #3, allowing for warmup of the 
system IO cache in #1.  HBase queries are somewhat slower because each new 
DFSInputStream created must contact the DataNode.  We can optimize this however 
I think for now we're good.

Here are the queries being run (50 times per round), they are non-trivial.

"states"
"unit*"
"uni*"
"u*d"
"un*d"
"united~0.75"
"united~0.6"
"unit~0.7"
"unit~0.5", // 2
"doctitle:/.*[Uu]nited.*/"
"united OR states"
"united AND states"
"nebraska AND states"
"\"united states\""
"\"united states\"~3"

> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>         Attachments: HBASE-3529.patch, 
> lucene-analyzers-common-4.0-SNAPSHOT.jar, lucene-core-4.0-SNAPSHOT.jar, 
> lucene-misc-4.0-SNAPSHOT.jar
>
>
> Using the Apache Lucene library we can add freetext search to HBase.  The 
> advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, 
> AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already 
> architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  
> This means cascading add, update, and deletes between a Lucene index and an 
> HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to 
> allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly 
> (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index 
> to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore 
> will explicitly not be used because the document/row data is stored in HBase. 
>  We will need to solve what the best data structure for efficiently mapping a 
> docid -> row key is.  It could be a docstore, field cache, column stride 
> fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be 
> searched on, meaning there's no need for a separate search related system in 
> Zookeeper.
> * Integrate search with HBase's RPC mechanism

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3529) Add search to HBase

Reply via email to