[
https://issues.apache.org/jira/browse/NUTCH-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004520#comment-14004520
]
Julien Nioche commented on NUTCH-1757:
--------------------------------------
Hi Markus
bq. metadata is passed via CrawlDatum, yet a CrawlDatum never makes it to a
parser filter nor a parser implementation, what am i missing?
You haven't missed anything, I had! One of the key elements was the call to the
ScoringFilters.
This patch actually does 2 things :
* it passes the metadata from the command line to the fetch step i.e. the
protocol implementations should be able to use the metadata
* it calls the scfilters.passScoreBeforeParsing and
scfilters.passScoreAfterParsing methods
A typical use of this is with the urlmeta plugin which takes the urlmeta.tags
parameter to pass the corresponding K/V to the outlinks. It does that thanks to
a ScoringFilter (URLMetaScoringFilter) which copies the Content object thanks
to the passScoreBeforeParsing method but also copies th K/V straight into the
parsedata with passScoreAfterParsing.
If you activate the urlmeta plugin and set a value for urlmeta.tags, your parse
filters should be able to retrieve the K/V from the Content object.
I am not particularly happy with the way we do things with that urlmeta plugin
and what it does should be part of the core code but that's a different topic.
> ParserChecker to take custom metadata as input
> ----------------------------------------------
>
> Key: NUTCH-1757
> URL: https://issues.apache.org/jira/browse/NUTCH-1757
> Project: Nutch
> Issue Type: Improvement
> Components: nutchNewbie, parser
> Affects Versions: 1.8
> Reporter: Julien Nioche
> Priority: Minor
> Fix For: 1.9
>
> Attachments: NUTCH-1757.patch, NUTCH-1757.patch.v2
>
>
> The patch attached allows to pass custom metadata on the command line (-md
> key=value) to the ParserChecker. This allows to have a similar behaviour as
> injecting metadata via the seed files. Some custom parser implementations can
> rely on such metadata, which is why the ParserChecker must allow to pass
> them.
--
This message was sent by Atlassian JIRA
(v6.2#6252)