[
https://issues.apache.org/jira/browse/NUTCH-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169195#comment-14169195
]
Sebastian Nagel commented on NUTCH-1872:
----------------------------------------
Hi Jonathan, you are right. Sorry :) For the prefix rule we need to track the
injected URL. The prefix is shared among all "ancestors" but we cannot
"reconstruct" the filter prefix. Because the prefix is shared among all
"ancestors" we could get the injected URL if we know its length. But host and
domain rules are reconstructible, so we could do without the injected URL for
these rules, right? Could be an optimization to think about. Thanks!
> enables control over how injected metadata is propagated
> --------------------------------------------------------
>
> Key: NUTCH-1872
> URL: https://issues.apache.org/jira/browse/NUTCH-1872
> Project: Nutch
> Issue Type: New Feature
> Reporter: Jonathan Cooper-Ellis
> Priority: Minor
> Attachments: urlmeta_propagation.diff
>
>
> This builds on NUTCH-655 and NUTCH-855, allowing users some control over
> which outlinks receive injected metadata. A new configuration property
> "urlmeta.rule" has been introduced, with a default value of "all".
> The value "all" indicated that "urlmeta.tags" should be propagated to all
> outlinks. Other options include: "host" (propagated to outlinks with the same
> host as the url with which the metadata was injected), "domain" (same, except
> with the same domain), "prefix" (treats the injected url as a prefix, so
> metadata is only propagated to urls that extend the injected url).
> Would appreciate feedback on whether you think this is a useful feature, and
> if its implemented properly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)