[ https://issues.apache.org/jira/browse/HBASE-3529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
liusheding updated HBASE-3529: ------------------------------ Description: Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanis was: Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are: * HBase is highly scalable and distributed * HBase is realtime * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc) * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase. * Scaling realtime search will be as simple as scaling HBase. Phase 1 - Indexing: * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa). * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers. * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure) * Mirror region splits with indexes (use Lucene's IndexSplitter?) * When a region is written to HDFS, also write the corresponding Lucene index to HDFS. * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism. * Write unit tests for the above Phase 2 - Queries: * Enable distributed Lucene queries * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper. * Integrate search with HBase's RPC mechanism > Add search to HBase > ------------------- > > Key: HBASE-3529 > URL: https://issues.apache.org/jira/browse/HBASE-3529 > Project: HBase > Issue Type: Improvement > Affects Versions: 0.90.0 > Reporter: Jason Rutherglen > Attachments: HBASE-3529.patch, HDFS-APPEND-0.20-LOCAL-FILE.patch > > > Using the Apache Lucene library we can add freetext search to HBase. The > advantages of this are: > * HBase is highly scalable and distributed > * HBase is realtime > * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312) > * Lucene offers many types of queries not currently available in HBase (eg, > AND, OR, NOT, phrase, etc) > * It's easier to build scalable realtime systems on top of already > architecturally sound, scalable realtime data system, eg, HBase. > * Scaling realtime search will be as simple as scaling HBase. > Phase 1 - Indexing: > * Integrate Lucene into HBase such that an index mirrors a given region. > This means cascading add, update, and deletes between a Lucene index and an > HBase region (and vice versa). > * Define meta-data to mark a region as indexed, and use a Solr schema to > allow the user to define the fields and analyzers. > * Integrate with the HLog to ensure that index recovery can occur properly > (eg, on region server failure) > * Mirror region splits with indexes (use Lucene's IndexSplitter?) > * When a region is written to HDFS, also write the corresponding Lucene index > to HDFS. > * A row key will be the ID of a given Lucene document. The Lucene docstore > will explicitly not be used because the document/row data is stored in HBase. > We will need to solve what the best data structure for efficiently mapping a > docid -> row key is. It could be a docstore, field cache, column stride > fields, or some other mechanism. > * Write unit tests for the above > Phase 2 - Queries: > * Enable distributed Lucene queries > * Regions that have Lucene indexes are inherently available and may be > searched on, meaning there's no need for a separate search related system in > Zookeeper. > * Integrate search with HBase's RPC mechanis -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira