[
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916890#action_12916890
]
Sertan Alkan commented on NUTCH-907:
------------------------------------
Hi Andrzej,
Thanks for the review and the feedback.
* Funny thing, I was actually going for {{datasetId}} for the name, but now
that you mention, I prefer to use {{crawlId}} for this and rename the old
{{crawlId}} to {{batchId}}. I am not entirely sure on how much invasive that's
going to be, but I don't think it will be much of a hassle to change both all
at once.
* I agree that arguments should override the configuration by actually setting
it so that the setting could be accessible elsewhere. I'll modify the patch to
work this way.
* A utility to handle the datasets is a good idea, though, considering the
current GORA architecture I think we may need to add a client interface there
somewhere. I've opened up an
[issue|http://github.com/enis/gora/issues/issue/56] for this, we can start
thinking about the design there. We won't be able write a generic utility in
Nutch, though, since this won't be available till we roll out a new version of
Gora. I'll pitch in the utility once we have that but as that doesn't affect
this issue directly, I'd rather go for a separate issue for that. And until
that issue is solved, I think it would be safe to leave manipulation of stores
(listing, removing, truncation.. etc) to user's responsibility.
> DataStore API doesn't support multiple storage areas for multiple disjoint
> crawls
> ---------------------------------------------------------------------------------
>
> Key: NUTCH-907
> URL: https://issues.apache.org/jira/browse/NUTCH-907
> Project: Nutch
> Issue Type: Bug
> Reporter: Andrzej Bialecki
> Fix For: 2.0
>
> Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb,
> page data, linkdb, etc) by specifying a path where the data was stored. This
> enabled users to run several disjoint crawls with different configs, but
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so
> that it can create stores (and data tables in the underlying storage) that
> use arbitrary prefixes to identify the particular crawl dataset. Then the
> Nutch API should be extended to allow passing this "crawlId" value to select
> one of possibly many existing crawl datasets.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.