[jira] Commented: (HADOOP-1913) [HBase] Build a Lucene index on an HBase table

Ning Li (JIRA) Tue, 18 Sep 2007 15:40:04 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12528597
 ]


Ning Li commented on HADOOP-1913:
---------------------------------

> Does not conjuring a substantial amount of XML in a StringBuilder in java 
> code become untenable soon after you have two or three columns Ning?  I think 
> folks are going to want to do their XML up in a file that they can pass the 
> MR job (and against which they can run xmllint, etc., to verify 
> well-formedness).

Yes, that'll be nice.

> Why not have your IndexConf add a resource named 'hbase-index.xml' as 
> HBaseConfiguration adds hbase-site.xml and hbase-default.xml.  Your example 
> pasted above would be the content of one such hbase-index.xml file.
> 
> Otherwise, can we add the config. to hbase-site.xml?  One property would list 
> columns and then per column, you'd add indexing properties with column name 
> as qualifier as in:

I was thinking, the indexing configuration is specified per job, not per hbase. 
Applications would want to specify different indexing configurations for 
different tables, which may have same column names. Different applications may 
even want to index the same table differently.

An alternative would be to include the configuration in application's jar file 
since that's what gets distributed out. But it's a bit awkward since a new jar 
file has to be generated for a new run. Yet another alternative is to store the 
configuration file in HDFS or HBase...

> [HBase] Build a Lucene index on an HBase table
> ----------------------------------------------
>
>                 Key: HADOOP-1913
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1913
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>            Reporter: Ning Li
>            Priority: Minor
>         Attachments: build_table_index.patch, 
> build_table_index.take2.again.patch, build_table_index.take2.patch
>
>
> This patch provides a Reducer class and other related classes which help to 
> build a Lucene index on an HBase table. The index build part is similar to 
> that of Nutch.
>   - Each row is modeled as a Lucene document: row key is indexed in its 
> untokenized form, column name-value pairs are Lucene field name-value pairs.
>   - IndexConf is used to configure various Lucene parameters, specify whether 
> to optimize an index and which columns to index and/or store, in tokenized or 
> untokenized form, etc.
>   - The number of reduce tasks decides the number of indexes (partitions). 
> The index(es) is stored in the output path of job configuration.
>   - The index build process is done in the reduce phase. Users can use the 
> map phase to join rows from different tables or to pre-parse/analyze column 
> content, etc.
>   - A junit test is added to test the build of an index on an HBase table 
> with an identity mapper. It also serves as an example on how to use the new 
> classes.
>   - BuildTableIndex is provided to help building an index on an HBase table. 
> It should be moved to examples package if HBase decides to have one.
> This patch requires the inclusion of the Lucene library.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1913) [HBase] Build a Lucene index on an HBase table

Reply via email to