[Nutch-dev] [jira] Issue Comment Edited: (NUTCH-520) A common infrastructure for different index backends

JIRA Mon, 23 Jul 2007 08:01:15 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513856
 ]


Doğacan Güney edited comment on NUTCH-520 at 7/23/07 7:59 AM:
--------------------------------------------------------------

Here is my proposal on how we can do it along with a patch:

i) Add a NutchDocument class:

    A NutchDocument contains a mapping from String-s to List<String>-s as 
fields, a metadata (to be explained later) and score. NutchDocument fields 
doesn't contain any information about how it is meant to be indexed or stored 
(not entirely true, explained later). These options are missing because 
different backends may not represent the same options. For example, solr 
doesn't (AFAIK) allow you to change how a field is stored at runtime. Also, one 
may want to index to a MySQL database (I don't know why, but it is possible), 
which again doesn't provide storage or indexing options.

ii) Add a NutchIndexWriter interface:

    NutchIndexWriter is the interface to be implemented if you want to add 
another indexing backend to nutch. A NutchIndexWriter writes, 
not-so-surprisingly, NutchDocument-s. Implementations are meant to take the 
NutchDocument, convert it into their internal format and then write the 
converted data. This patch adds two NutchIndexWriter-s: LuceneWriter and 
SolrWriter.

Also, Indexer.OutputFormat is updated to use NutchIndexWriter instead of 
lucene's index writer. After this patch, it is possible to index to more than 
one backend simultaneously. Indexer is now used like this:

bin/nutch index -lucene crawl/indexes -solr "http://...."; crawl/crawldb 
crawl/linkdb crawl/segments...

You can use either lucene or solr backend or both.

iii) Allow indexing filters to define index-level and document-level metadata:

     NutchDocument fields are simple key/value pairs and LuceneWriter can't 
determine how to store/index them by just looking at the fields. There are two 
ways to pass data to index backends:

    1) Through configuration: Options specified in configuration are meant to 
be valid for all documents. A new method "addIndexBackendOptions" is added to 
IndexingFilter. This is used by indexing filters to add 'hints' to index 
backends.

        For example, index-basic plugin calls:

 LuceneWriter.addFieldOptions("title", LuceneWriter.STORE.YES,  
LuceneWriter.INDEX.TOKENIZED, conf);

     This tells the lucene backend to store and tokenize title.
     
     2) Document-level: Per-document free form string,string[] pairs. For 
example, if you normally want to store field "foo" in a lucene index, but you 
do not want to do it for a specific document, you can add a 
<"lucene.field.foo", "lucene.store.no"> pair to that document's metadata and 
LuceneWriter will not store field value of "foo" for that particular document.

Extra notes:

* This patch is a very early draft. I am sure that a lot of stuff doesn't work. 
However, I tested indexing a 30000 url segment to both solr and lucene and 
didn't run into any problems. When only indexing to lucene, there is no 
noticable performance difference from earlier nutch versions.

* NutchDocument has a add(Field) method for easy-upgrade of older indexing 
filters. However, it is slower compared and should only be used for upgrading.

* I believe that this is a very important feature for nutch. (I don't know why 
I am writing this as a note)

Comments, suggestions, reviews and other feedback are welcome.

Edit: Updated to reflect the latest patch.


 was:
Here is my proposal on how we can do it along with a patch:

i) Add a NutchDocument class:

    A NutchDocument contains a mapping from String-s to List<String>-s as 
fields, a metadata (to be explained later) and score. NutchDocument fields 
doesn't contain any information about how it is meant to be indexed or stored 
(not entirely true, explained later). These options are missing because 
different backends may not represent the same options. For example, solr 
doesn't (AFAIK) allow you to change how a field is stored at runtime. Also, one 
may want to index to a MySQL database (I don't know why, but it is possible), 
which again doesn't provide storage or indexing options.

ii) Add a NutchIndexWriter interface:

    NutchIndexWriter is the interface to be implemented if you want to add 
another indexing backend to nutch. A NutchIndexWriter writes, 
not-so-surprisingly, NutchDocument-s. Implementations are meant to take the 
NutchDocument, convert it into their internal format and then write the 
converted data. Also, Indexer.OutputFormat creates a list of enabled classes 
that implement NutchIndexWriter. Upon each call to 
OutputFormat.RecordWriter.write(Text url, NutchDocument document), this method 
delegates it to NutchIndexWriter.write(NutchDocument doc).

    This patch adds two classes that implement NutchIndexWriter. LuceneWriter 
and SolrWriter.

iii) Allow indexing filters to define index-level and document-level metadata:

     NutchDocument fields are simple key/value pairs and LuceneWriter can't 
determine how to store/index them by just looking at the fields. Because of 
this, two types of metadata are defined:

    1) Index-level: Index level metadata is meant to be valid for all documents 
and doesn't change for every document. Metadata-s are free form string,string[] 
pairs that are meant to be picked up by underlying index backends.

        A new method "addIndexMeta" is added to ScoringFilter. This is used by 
indexing filters to add 'hints' to index backends.

        For example, index-basic plugin adds <"lucene.field.title", 
"lucene.store.yes"> and <"lucene.field.title", "lucene.index.tokenized"> pairs, 
hinting to lucene backend that field "title" is meant to be stored and 
tokenized. Implementations should always prefix meta keys with the name of the 
backend to avoid conflict. If lucene backend is not active, this information is 
simply ignored. 
     
     2) Document-level: Per-document free form string,string[] pairs. For 
example, if you normally want to store field "foo" in a lucene index, but you 
do not want to do it for a specific document, you can add a 
<"lucene.field.foo", "lucene.store.no"> pair to that document's metadata and 
LuceneWriter will not store field value of "foo" for that document.

Extra notes:

* This patch is a very early draft. I am sure that a lot of stuff doesn't work. 
However, I tested indexing a 30000 url segment to both solr and lucene and 
didn't run into any problems. When only indexing to lucene, there is no 
noticable performance difference from earlier nutch versions.

* NutchDocument has a add(Field) method for easy-upgrade of older indexing 
filters. However, it is slower compared and should only be used for upgrading.

* I believe that this is a very important feature for nutch. (I don't know why 
I am writing this as a note)

Comments, suggestions, reviews and other feedback are welcome.

> A common infrastructure for different index backends
> ----------------------------------------------------
>
>                 Key: NUTCH-520
>                 URL: https://issues.apache.org/jira/browse/NUTCH-520
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Doğacan Güney
>         Attachments: RFC_multiple_index_backends.patch, 
> RFC_multiple_index_backends_v2.patch, RFC_multiple_index_backends_v3.patch
>
>
> With the discussion of solr as a possible index and search backend, I think 
> we need a new indexing architecture (that doesn't depend on lucene) that can 
> use multiple backends to index.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Issue Comment Edited: (NUTCH-520) A common infrastructure for different index backends

Reply via email to