[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356711#comment-14356711
]
Markus Jelsma commented on NUTCH-1932:
--------------------------------------
Hm yes, i've thought about using a scoring filter too. However, we do need some
code in CrawlDbReducer.reduce() because in the end we want to completely remove
the record from the CrawlDB. A work-around, maybe elegant but useful, would be
to introduce the CrawlDatum to URL filtering and normalizing.
We have some other Nutch jobs that would benefit from having method signature
like normalize(String url, CrawlDatum datum, String scope), same is true for
filter.
> Automatically remove orphaned pages
> -----------------------------------
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink
> count of 1 is enough.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)