[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sertan Alkan updated NUTCH-907: ------------------------------- Attachment: NUTCH-907.patch Here's a patch to allow Nutch to create different schemas to based on the same schema definition. Some points about the patch; * To be able to prefix a schema name with a value, Nutch needs to know the default schema name defined in the gora mapping file (e.g ...table=<name>...). Gora handles creation internally at the moment and it doesn't expose this name to outside. So, the patch introduces two new configuration options to pass the schema name to Nutch internals. ** Nutch *ignores* the schema name setting in gora mapping file, instead, configuration option {{storage.schema}} will tell the Nutch which schema name it should use to access to data store. This value is defaulted to _webpage_. ** {{storage.schema.id}} option defines the prefix to add to schema name in {{storage.schema}}, and by default this id is not provided, i.e. all jobs will run on _webpage_ store as before. * Apart from giving it as a configuration option, all jobs (injector, generator, fetcher, updatedb, indexer, benchmark and webtable reader) are modified to accept a schema id as an optional command line argument, {{-schemaId}}, which will override the configuration option ({{-schemaId}} may seem an odd name but I am not big on naming things). * Patch also modifies unit tests to use the same logic. All unit tests pass without a problem and I have run a simple crawl with a)default configuration, b)by providing a schema id from configuration and c)giving the ids from command line and jobs seem to run well. > DataStore API doesn't support multiple storage areas for multiple disjoint > crawls > --------------------------------------------------------------------------------- > > Key: NUTCH-907 > URL: https://issues.apache.org/jira/browse/NUTCH-907 > Project: Nutch > Issue Type: Bug > Reporter: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-907.patch > > > In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, > page data, linkdb, etc) by specifying a path where the data was stored. This > enabled users to run several disjoint crawls with different configs, but > still using the same storage medium, just under different paths. > This is not possible now because there is a 1:1 mapping between a specific > DataStore instance and a set of crawl data. > In order to support this functionality the Gora API should be extended so > that it can create stores (and data tables in the underlying storage) that > use arbitrary prefixes to identify the particular crawl dataset. Then the > Nutch API should be extended to allow passing this "crawlId" value to select > one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.