[ https://issues.apache.org/jira/browse/HADOOP-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ning Li updated HADOOP-1913: ---------------------------- Attachment: build_table_index.take2.patch Thanks for the comments! > Shouldn't IndexConf extend HBaseConfiguration else you'll not have the hbase > settings in the mix (Would IndexConfiguration be a better name than > IndexConf). The content of an index configuration is actually a property value in an hbase configuration. You can see an example in BuildTableIndex.java > You made the patch inside $HBASE_HOME/src rather than at $HADOOP_HOME. You > should fix. Otherwise it won't apply when hudson tries to apply it. Done in take2. > You way you add the per-column config. into a hadoop configuration is very > cute. I'm unclear how mulitple columns are done..... Should there be a > columns element to hold multiple column elements? I'd suggest you add > javadoc with example config. ('cos trying to read conjure the xml produced by > the code takes a little effort). There is an example index configuration in BuildTableIndex.java. Configurations for a column are in a "column" element. I'll add the example to javadoc once we agree on the best way to do index configuration. > Ning, have you tried your patch on a distributed cluster? Does your column > trick get properly distributed out and your LuceneDocumentWrapper work in the > distributed context? > > Did you use lucene 2.2 or something else? > I had a problem compiling: Oops. The compiling problem was my mistake (forgot to remove some unused code). All fixed in take2. Yes, I included Lucene 2.2 in hbase/lib. And yes, I have tested on a distributed cluster. Since an index configuration content is a property in an hbase configuration, it does work properly in the distributed environment. > [HBase] Build a Lucene index on an HBase table > ---------------------------------------------- > > Key: HADOOP-1913 > URL: https://issues.apache.org/jira/browse/HADOOP-1913 > Project: Hadoop > Issue Type: New Feature > Components: contrib/hbase > Reporter: Ning Li > Priority: Minor > Attachments: build_table_index.patch, build_table_index.take2.patch > > > This patch provides a Reducer class and other related classes which help to > build a Lucene index on an HBase table. The index build part is similar to > that of Nutch. > - Each row is modeled as a Lucene document: row key is indexed in its > untokenized form, column name-value pairs are Lucene field name-value pairs. > - IndexConf is used to configure various Lucene parameters, specify whether > to optimize an index and which columns to index and/or store, in tokenized or > untokenized form, etc. > - The number of reduce tasks decides the number of indexes (partitions). > The index(es) is stored in the output path of job configuration. > - The index build process is done in the reduce phase. Users can use the > map phase to join rows from different tables or to pre-parse/analyze column > content, etc. > - A junit test is added to test the build of an index on an HBase table > with an identity mapper. It also serves as an example on how to use the new > classes. > - BuildTableIndex is provided to help building an index on an HBase table. > It should be moved to examples package if HBase decides to have one. > This patch requires the inclusion of the Lucene library. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.