[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Julien Nioche (JIRA) Wed, 06 Jan 2010 09:02:27 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797176#action_12797176
 ]


Julien Nioche commented on NUTCH-655:
-------------------------------------

good idea. I've made the modification and documented in the javadoc :

The URL files contain one URL per line, optionally followed by custom metadata 
separated by tabs with the metadata key separated from the corresponding value 
by '='. 
Note that some metadata keys are reserved : 
- <i>nutch.score</i> : allows to set a custom score for a specific URL <br>
- <i>nutch.fetchInterval</i> : allows to set a custom fetch interval for a 
specific URL <br>
e.g. http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 \t 
userType=open_source
 

> Injecting Crawl metadata
> ------------------------
>
>                 Key: NUTCH-655
>                 URL: https://issues.apache.org/jira/browse/NUTCH-655
>             Project: Nutch
>          Issue Type: Improvement
>          Components: injector
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: Injector.patch, NUTCH-655.v2
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file 
> has to contain fields separated by tabs, with the URL being on the first 
> column. The metadata names and values are separated by '='. A input line 
> might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it 
> with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-655) Injecting Crawl metadata

Reply via email to