[ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14356711#comment-14356711
 ] 

Markus Jelsma commented on NUTCH-1932:
--------------------------------------

Hm yes, i've thought about using a scoring filter too. However, we do need some 
code in CrawlDbReducer.reduce() because in the end we want to completely remove 
the record from the CrawlDB. A work-around, maybe elegant but useful, would be 
to introduce the CrawlDatum to URL filtering and normalizing.
We have some other Nutch jobs that would benefit from having method signature 
like normalize(String url, CrawlDatum datum, String scope), same is true for 
filter.

> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old 
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink 
> count of 1 is enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to