[
https://issues.apache.org/jira/browse/NUTCH-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151537#comment-14151537
]
Julien Nioche commented on NUTCH-1403:
--------------------------------------
You are right, the urlmeta plugin currently does not distinguish between the
different levels and uses a single config 'urlmeta.tags'. Having 3 separate
configs is probably a good idea.
Not sure the 4th one you suggested is really necessary ("a list of meta data to
pass from crawl datum to content to parse data") as it is the same as using the
same values for the first 2. Having it would probably generate some confusion.
Thanks
> Add default ScoringFilter for manipulating metadata
> ----------------------------------------------------
>
> Key: NUTCH-1403
> URL: https://issues.apache.org/jira/browse/NUTCH-1403
> Project: Nutch
> Issue Type: Improvement
> Reporter: Julien Nioche
>
> This is currently done by the urlmeta plugin, which has too vague a name and
> a redundant indexing filter now that we have the index-metadata plugin. This
> scoring filter would help defining which metadata to pass from :
> - the crawl metadata to the content metadata
> - the content metadata to the parse metadata
> - the parse metadata to the crawldatum for the outlinks
> I'd make this scoring filter available by default i.e. not in a separate
> plugin as its functionalities are commonly used.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)