[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

JIRA Thu, 16 Sep 2010 04:29:20 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910098#action_12910098
 ]


Doğacan Güney commented on NUTCH-907:
-------------------------------------

Gora already supports this somewhat. While creating a data store, you can 
optionally specify a table name:

  public static <D extends DataStore<K,T>, K, T extends Persistent>
  D createDataStore(Class<D> dataStoreClass
      , Class<K> keyClass, Class<T> persistent, String schemaName)

We should be able to leverage that in Nutch to support different crawl 
datasets. If we extend Nutch's current API to allow names to be specified for 
crawls then Nutch can simply create tables prefixed with crawl names as Andrzej 
suggested. For example, a crawl dataset with name "foo" will have a table 
called "foo_webtable".

What do you think Andrzej? I think Gora needs no extension here but if people 
think API is awkward we can change Gora too.

> DataStore API doesn't support multiple storage areas for multiple disjoint 
> crawls
> ---------------------------------------------------------------------------------
>
>                 Key: NUTCH-907
>                 URL: https://issues.apache.org/jira/browse/NUTCH-907
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Andrzej Bialecki 
>             Fix For: 2.0
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
> page data, linkdb, etc) by specifying a path where the data was stored. This 
> enabled users to run several disjoint crawls with different configs, but 
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific 
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so 
> that it can create stores (and data tables in the underlying storage) that 
> use arbitrary prefixes to identify the particular crawl dataset. Then the 
> Nutch API should be extended to allow passing this "crawlId" value to select 
> one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

Reply via email to