[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Andrzej Bialecki (JIRA) Fri, 01 Oct 2010 05:22:01 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916870#action_12916870
 ]


Andrzej Bialecki  commented on NUTCH-907:
-----------------------------------------

Hi Sertan,

Thanks for the patch, this looks very good! A few  comments:

* I'm not good at naming things either... schemaId is a little bit cryptic 
though. If we didn't already use crawlId I would vote for that (and then rename 
crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..

* since we now create multiple datasets, we need somehow to manage them - i.e. 
list and delete at least (create is implicit). There is no such functionality 
in this patch, but this can be addressed also as a separate issue.

* IndexerMapReduce.createIndexJob: I think it would be useful to pass the 
"datasetId" as a Job property - this way indexing filter plugins can use this 
property to populate NutchDocument fields if needed. FWIW, this may be a good 
idea to do in other jobs as well...

> DataStore API doesn't support multiple storage areas for multiple disjoint 
> crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>         Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
> page data, linkdb, etc) by specifying a path where the data was stored. This 
> enabled users to run several disjoint crawls with different configs, but 
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific 
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so 
> that it can create stores (and data tables in the underlying storage) that 
> use arbitrary prefixes to identify the particular crawl dataset. Then the 
> Nutch API should be extended to allow passing this "crawlId" value to select 
> one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Reply via email to