[
https://issues.apache.org/jira/browse/NUTCH-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14168776#comment-14168776
]
Sebastian Nagel commented on NUTCH-1872:
----------------------------------------
Thanks, [~jcoopere]! Nice patch!
A question about {{injectedUrl}} in distributeScoreToOutlinks(): do we really
track and keep the injected URL? Prefix/host/domain (which depends on the
chosen rule) of injected and all target URLs are the same. This relation is
transitive and is "inherited" to all "ancestors" of one injected URL.
About {{e.printStackTrace();}} in case of a MalformedURLException: URLs should
be valid at this point: we could ignore the exception, or show a warning but we
don't need the stack.
> enables control over how injected metadata is propagated
> --------------------------------------------------------
>
> Key: NUTCH-1872
> URL: https://issues.apache.org/jira/browse/NUTCH-1872
> Project: Nutch
> Issue Type: New Feature
> Reporter: Jonathan Cooper-Ellis
> Priority: Minor
> Attachments: urlmeta_propagation.diff
>
>
> This builds on NUTCH-655 and NUTCH-855, allowing users some control over
> which outlinks receive injected metadata. A new configuration property
> "urlmeta.rule" has been introduced, with a default value of "all".
> The value "all" indicated that "urlmeta.tags" should be propagated to all
> outlinks. Other options include: "host" (propagated to outlinks with the same
> host as the url with which the metadata was injected), "domain" (same, except
> with the same domain), "prefix" (treats the injected url as a prefix, so
> metadata is only propagated to urls that extend the injected url).
> Would appreciate feedback on whether you think this is a useful feature, and
> if its implemented properly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)