[jira] Updated: (HADOOP-1913) [HBase] Build a Lucene index on an HBase table

Ning Li (JIRA) Mon, 17 Sep 2007 19:32:04 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ning Li updated HADOOP-1913:
----------------------------

    Attachment: build_table_index.take2.patch

Thanks for the comments!

> Shouldn't IndexConf extend HBaseConfiguration else you'll not have the hbase 
> settings in the mix (Would IndexConfiguration be a better name than 
> IndexConf).

The content of an index configuration is actually a property value in an hbase 
configuration. You can see an example in BuildTableIndex.java

> You made the patch inside $HBASE_HOME/src rather than at $HADOOP_HOME.  You 
> should fix.  Otherwise it won't apply when hudson tries to apply it.

Done in take2.

> You way you add the per-column config. into a hadoop configuration is very 
> cute.  I'm unclear how mulitple columns are done..... Should there be a 
> columns element to hold multiple column elements?   I'd suggest you add 
> javadoc with example config. ('cos trying to read conjure the xml produced by 
> the code takes a little effort).

There is an example index configuration in BuildTableIndex.java. Configurations 
for a column are in a "column" element. I'll add the example to javadoc once we 
agree on the best way to do index configuration.

> Ning, have you tried your patch on a distributed cluster?  Does your column 
> trick get properly distributed out and your LuceneDocumentWrapper work in the 
> distributed context?
> 
> Did you use lucene 2.2 or something else?
> I had a problem compiling:

Oops. The compiling problem was my mistake (forgot to remove some unused code). 
All fixed in take2. Yes, I included Lucene 2.2 in hbase/lib. And yes, I have 
tested on a distributed cluster. Since an index configuration content is a 
property in an hbase configuration, it does work properly in the distributed 
environment.

> [HBase] Build a Lucene index on an HBase table
> ----------------------------------------------
>
>                 Key: HADOOP-1913
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1913
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Ning Li
>            Priority: Minor
>         Attachments: build_table_index.patch, build_table_index.take2.patch
>
>
> This patch provides a Reducer class and other related classes which help to 
> build a Lucene index on an HBase table. The index build part is similar to 
> that of Nutch.
>   - Each row is modeled as a Lucene document: row key is indexed in its 
> untokenized form, column name-value pairs are Lucene field name-value pairs.
>   - IndexConf is used to configure various Lucene parameters, specify whether 
> to optimize an index and which columns to index and/or store, in tokenized or 
> untokenized form, etc.
>   - The number of reduce tasks decides the number of indexes (partitions). 
> The index(es) is stored in the output path of job configuration.
>   - The index build process is done in the reduce phase. Users can use the 
> map phase to join rows from different tables or to pre-parse/analyze column 
> content, etc.
>   - A junit test is added to test the build of an index on an HBase table 
> with an identity mapper. It also serves as an example on how to use the new 
> classes.
>   - BuildTableIndex is provided to help building an index on an HBase table. 
> It should be moved to examples package if HBase decides to have one.
> This patch requires the inclusion of the Lucene library.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-1913) [HBase] Build a Lucene index on an HBase table

Reply via email to