[
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sertan Alkan updated NUTCH-907:
-------------------------------
Attachment: NUTCH-907.v2.patch
Here's the modified version of the patch after Andrzej's review. The additional
points to the original patch are as follows;
* The old {{crawlId}} option is renamed to {{batchId}} for convenience.
* All jobs now accept an optional argument, {{-crawlId <id>}}, to prefix the
schema. Jobs now keep this property in the configuration allowing later use by,
say, plugins.
All unit tests pass and again I have run a simple crawl w/o any problems. I
have also tested the {{batchId}} option by generating two different sets of the
injected urls and run a fetch-parse cycle on those sets. Jobs seem to recognize
the correct {{batchId}} and select only the corresponding urls.
Like I said before, I prefer to leave store manipulation utility out of this
patch, and handle it in a separate issue once we have that functionality in
Gora. What do you think?
> DataStore API doesn't support multiple storage areas for multiple disjoint
> crawls
> ---------------------------------------------------------------------------------
>
> Key: NUTCH-907
> URL: https://issues.apache.org/jira/browse/NUTCH-907
> Project: Nutch
> Issue Type: Bug
> Reporter: Andrzej Bialecki
> Fix For: 2.0
>
> Attachments: NUTCH-907.patch, NUTCH-907.v2.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb,
> page data, linkdb, etc) by specifying a path where the data was stored. This
> enabled users to run several disjoint crawls with different configs, but
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so
> that it can create stores (and data tables in the underlying storage) that
> use arbitrary prefixes to identify the particular crawl dataset. Then the
> Nutch API should be extended to allow passing this "crawlId" value to select
> one of possibly many existing crawl datasets.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.