[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745205#comment-14745205
]
Sebastian Nagel commented on NUTCH-1932:
----------------------------------------
Hi Markus, that looks quite simple
- do we still need a new CrawlDatum status "db_orphan"? The ScoringFilter
scoring-orphan sets the status to "db_gone" which should be sufficient in
combination with "db.update.purge.404" (CrawlDb) and -deleteGone (indexer).
Instead of "db.update.purge.orphans" a property could be usefule to configure
the time when a page is considered to be orphaned. Btw. the 10 minutes (as in
the last patch) are rather for testing, right?
- the location of scoring-orphan should not be
src/plugin/scoring-orphan/src/java/io/openindex/...
> Automatically remove orphaned pages
> -----------------------------------
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch,
> NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink
> count of 1 is enough.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)