[jira] Updated: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Sertan Alkan (JIRA) Mon, 27 Sep 2010 05:29:00 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sertan Alkan updated NUTCH-907:
-------------------------------

    Attachment: NUTCH-907.patch

Here's a patch to allow Nutch to create different schemas to based on the same 
schema definition. Some points about the patch;

* To be able to prefix a schema name with a value, Nutch needs to know the 
default schema name defined in the gora mapping file (e.g ...table=<name>...). 
Gora handles creation internally at the moment and it doesn't expose this name 
to outside. So, the patch introduces two new configuration options to pass the 
schema name to Nutch internals.
** Nutch *ignores* the schema name setting in gora mapping file, instead, 
configuration option {{storage.schema}} will tell the Nutch which schema name 
it should use to access to data store. This value is defaulted to _webpage_.
** {{storage.schema.id}} option defines the prefix to add to schema name in 
{{storage.schema}}, and by default this id is not provided, i.e. all jobs will 
run on _webpage_ store as before.
* Apart from giving it as a configuration option, all jobs (injector, 
generator, fetcher, updatedb, indexer, benchmark and webtable reader) are 
modified to accept a schema id as an optional command line argument, 
{{-schemaId}}, which will override the configuration option ({{-schemaId}} may 
seem an odd name but I am not big on naming things).
* Patch also modifies unit tests to use the same logic.

All unit tests pass without a problem and I have run a simple crawl with 
a)default configuration, b)by providing a schema id from configuration and 
c)giving the ids from command line and jobs seem to run well.

> DataStore API doesn't support multiple storage areas for multiple disjoint 
> crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
> page data, linkdb, etc) by specifying a path where the data was stored. This 
> enabled users to run several disjoint crawls with different configs, but 
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific 
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so 
> that it can create stores (and data tables in the underlying storage) that 
> use arbitrary prefixes to identify the particular crawl dataset. Then the 
> Nutch API should be extended to allow passing this "crawlId" value to select 
> one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Reply via email to