[ 
https://issues.apache.org/jira/browse/NUTCH-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181936#comment-14181936
 ] 

Jonathan Cooper-Ellis commented on NUTCH-1872:
----------------------------------------------

Injected url is now set in injectedScore(), and propagated in 
passScoreBeforeParsing() and passScoreAfterParsing when "prefix" is the rule.

One thing I realized while doing this is that when the rule is "prefix" all 
other metadata should be propagated as well, for the same reason injectedUrl 
must be propagated (the case of a crawl that wanders away from the prefix and 
then returns). If it is not, then it will be lost as soon as a target is 
encountered that does not match the injectedUrl prefix. Does this sound right?

My solution to this is to propagate injectedUrl and all metadata to all targets 
when the rule is "prefix", and instead identify non-matching urls in 
URLMetaIndexingFilter and just don't index urlmeta.tags for them. The other way 
to do this would be to ignore the case of a "wandering crawl" and handle 
"prefix" in the same way as "host" or "domain" (with a check in 
URLMetaScoringFilter, and leaving URLMetaIndexingFilter alone). What do you 
think?

Attaching a new diff that I believe is working as intended.

> enables control over how injected metadata is propagated
> --------------------------------------------------------
>
>                 Key: NUTCH-1872
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1872
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Jonathan Cooper-Ellis
>            Priority: Minor
>
> This builds on NUTCH-655 and NUTCH-855, allowing users some control over 
> which outlinks receive injected metadata. A new configuration property 
> "urlmeta.rule" has been introduced, with a default value of "all".
> The value "all" indicated that "urlmeta.tags" should be propagated to all 
> outlinks. Other options include: "host" (propagated to outlinks with the same 
> host as the url with which the metadata was injected), "domain" (same, except 
> with the same domain), "prefix" (treats the injected url as a prefix, so 
> metadata is only propagated to urls that extend the injected url).
> Would appreciate feedback on whether you think this is a useful feature, and 
> if its implemented properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to