[jira] [Updated] (HBASE-3529) Add search to HBase

liusheding (JIRA) Thu, 13 Sep 2012 00:09:11 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


liusheding updated HBASE-3529:
------------------------------

    Description: 
Using the Apache Lucene library we can add freetext search to HBase.  The 
advantages of this are:

* HBase is highly scalable and distributed
* HBase is realtime
* Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
* Lucene offers many types of queries not currently available in HBase (eg, 
AND, OR, NOT, phrase, etc)
* It's easier to build scalable realtime systems on top of already 
architecturally sound, scalable realtime data system, eg, HBase.
* Scaling realtime search will be as simple as scaling HBase.

Phase 1 - Indexing:

* Integrate Lucene into HBase such that an index mirrors a given region.  This 
means cascading add, update, and deletes between a Lucene index and an HBase 
region (and vice versa).
* Define meta-data to mark a region as indexed, and use a Solr schema to allow 
the user to define the fields and analyzers.
* Integrate with the HLog to ensure that index recovery can occur properly (eg, 
on region server failure)
* Mirror region splits with indexes (use Lucene's IndexSplitter?)
* When a region is written to HDFS, also write the corresponding Lucene index 
to HDFS.
* A row key will be the ID of a given Lucene document.  The Lucene docstore 
will explicitly not be used because the document/row data is stored in HBase.  
We will need to solve what the best data structure for efficiently mapping a 
docid -> row key is.  It could be a docstore, field cache, column stride 
fields, or some other mechanism.
* Write unit tests for the above

Phase 2 - Queries:

* Enable distributed Lucene queries
* Regions that have Lucene indexes are inherently available and may be searched 
on, meaning there's no need for a separate search related system in Zookeeper.
* Integrate search with HBase's RPC mechanis



  was:
Using the Apache Lucene library we can add freetext search to HBase.  The 
advantages of this are:

* HBase is highly scalable and distributed
* HBase is realtime
* Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
* Lucene offers many types of queries not currently available in HBase (eg, 
AND, OR, NOT, phrase, etc)
* It's easier to build scalable realtime systems on top of already 
architecturally sound, scalable realtime data system, eg, HBase.
* Scaling realtime search will be as simple as scaling HBase.

Phase 1 - Indexing:

* Integrate Lucene into HBase such that an index mirrors a given region.  This 
means cascading add, update, and deletes between a Lucene index and an HBase 
region (and vice versa).
* Define meta-data to mark a region as indexed, and use a Solr schema to allow 
the user to define the fields and analyzers.
* Integrate with the HLog to ensure that index recovery can occur properly (eg, 
on region server failure)
* Mirror region splits with indexes (use Lucene's IndexSplitter?)
* When a region is written to HDFS, also write the corresponding Lucene index 
to HDFS.
* A row key will be the ID of a given Lucene document.  The Lucene docstore 
will explicitly not be used because the document/row data is stored in HBase.  
We will need to solve what the best data structure for efficiently mapping a 
docid -> row key is.  It could be a docstore, field cache, column stride 
fields, or some other mechanism.
* Write unit tests for the above

Phase 2 - Queries:

* Enable distributed Lucene queries
* Regions that have Lucene indexes are inherently available and may be searched 
on, meaning there's no need for a separate search related system in Zookeeper.
* Integrate search with HBase's RPC mechanism



    
> Add search to HBase
> -------------------
>
>                 Key: HBASE-3529
>                 URL: https://issues.apache.org/jira/browse/HBASE-3529
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.0
>            Reporter: Jason Rutherglen
>         Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch
>
>
> Using the Apache Lucene library we can add freetext search to HBase.  The 
> advantages of this are:
> * HBase is highly scalable and distributed
> * HBase is realtime
> * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
> * Lucene offers many types of queries not currently available in HBase (eg, 
> AND, OR, NOT, phrase, etc)
> * It's easier to build scalable realtime systems on top of already 
> architecturally sound, scalable realtime data system, eg, HBase.
> * Scaling realtime search will be as simple as scaling HBase.
> Phase 1 - Indexing:
> * Integrate Lucene into HBase such that an index mirrors a given region.  
> This means cascading add, update, and deletes between a Lucene index and an 
> HBase region (and vice versa).
> * Define meta-data to mark a region as indexed, and use a Solr schema to 
> allow the user to define the fields and analyzers.
> * Integrate with the HLog to ensure that index recovery can occur properly 
> (eg, on region server failure)
> * Mirror region splits with indexes (use Lucene's IndexSplitter?)
> * When a region is written to HDFS, also write the corresponding Lucene index 
> to HDFS.
> * A row key will be the ID of a given Lucene document.  The Lucene docstore 
> will explicitly not be used because the document/row data is stored in HBase. 
>  We will need to solve what the best data structure for efficiently mapping a 
> docid -> row key is.  It could be a docstore, field cache, column stride 
> fields, or some other mechanism.
> * Write unit tests for the above
> Phase 2 - Queries:
> * Enable distributed Lucene queries
> * Regions that have Lucene indexes are inherently available and may be 
> searched on, meaning there's no need for a separate search related system in 
> Zookeeper.
> * Integrate search with HBase's RPC mechanis

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3529) Add search to HBase

Reply via email to