[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916870#action_12916870 ]
Andrzej Bialecki commented on NUTCH-907: ----------------------------------------- Hi Sertan, Thanks for the patch, this looks very good! A few comments: * I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId .. * since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue. * IndexerMapReduce.createIndexJob: I think it would be useful to pass the "datasetId" as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well... > DataStore API doesn't support multiple storage areas for multiple disjoint > crawls > --------------------------------------------------------------------------------- > > Key: NUTCH-907 > URL: https://issues.apache.org/jira/browse/NUTCH-907 > Project: Nutch > Issue Type: Bug > Reporter: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-907.patch > > > In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, > page data, linkdb, etc) by specifying a path where the data was stored. This > enabled users to run several disjoint crawls with different configs, but > still using the same storage medium, just under different paths. > This is not possible now because there is a 1:1 mapping between a specific > DataStore instance and a set of crawl data. > In order to support this functionality the Gora API should be extended so > that it can create stores (and data tables in the underlying storage) that > use arbitrary prefixes to identify the particular crawl dataset. Then the > Nutch API should be extended to allow passing this "crawlId" value to select > one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.