[ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745216#comment-14745216
 ] 

Markus Jelsma commented on NUTCH-1932:
--------------------------------------

Hey Sebastian, i fixed the location, it is all org.apache now. Re, the new 
status: it must be marked as GONE first, so a later indexer job can then remove 
it. If i use purge.404, the CrawlDbReducer will delete it before an indexer has 
the change to remove it. This new patch marks a record as GONE after a 
specified time so the indexer can delete them. Later it is marked as ORPHAN at 
another specified time, at which it is deleted from the CrawlDB.

Tests do not yet pass

> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old 
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink 
> count of 1 is enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to