[ 
https://issues.apache.org/jira/browse/NUTCH-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004520#comment-14004520
 ] 

Julien Nioche commented on NUTCH-1757:
--------------------------------------

Hi Markus

bq. metadata is passed via CrawlDatum, yet a CrawlDatum never makes it to a 
parser filter nor a parser implementation, what am i missing?

You haven't missed anything, I had! One of the key elements was the call to the 
ScoringFilters.

This patch actually does 2 things : 
* it passes the metadata from the command line to the fetch step i.e. the 
protocol implementations should be able to use the metadata
* it calls the scfilters.passScoreBeforeParsing and  
scfilters.passScoreAfterParsing methods

A typical use of this is with the urlmeta plugin which takes the urlmeta.tags 
parameter to pass the corresponding K/V to the outlinks. It does that thanks to 
a ScoringFilter (URLMetaScoringFilter) which copies the Content object thanks 
to the passScoreBeforeParsing method but also copies th K/V straight into the 
parsedata with passScoreAfterParsing.

If you activate the urlmeta plugin and set a value for urlmeta.tags, your parse 
filters should be able to retrieve the K/V from the Content object.

I am not particularly happy with the way we do things with that urlmeta plugin 
and what it does should be part of the core code but that's a different topic. 





 

> ParserChecker to take custom metadata as input
> ----------------------------------------------
>
>                 Key: NUTCH-1757
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1757
>             Project: Nutch
>          Issue Type: Improvement
>          Components: nutchNewbie, parser
>    Affects Versions: 1.8
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-1757.patch, NUTCH-1757.patch.v2
>
>
> The patch attached allows to pass custom metadata on the command line (-md 
> key=value) to the ParserChecker. This allows to have a similar behaviour as 
> injecting metadata via the seed files. Some custom parser implementations can 
> rely on such metadata, which is why the ParserChecker must allow to pass 
> them. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to