[ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745205#comment-14745205
 ] 

Sebastian Nagel commented on NUTCH-1932:
----------------------------------------

Hi Markus, that looks quite simple
- do we still need a new CrawlDatum status "db_orphan"? The ScoringFilter 
scoring-orphan sets the status to "db_gone" which should be sufficient in 
combination with "db.update.purge.404" (CrawlDb) and -deleteGone (indexer). 
Instead of "db.update.purge.orphans" a property could be usefule to configure 
the time when a page is considered to be orphaned. Btw. the 10 minutes (as in 
the last patch) are rather for testing, right?
- the location of scoring-orphan should not be 
src/plugin/scoring-orphan/src/java/io/openindex/...

> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old 
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink 
> count of 1 is enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to