[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312479#comment-14312479
]
Sebastian Nagel commented on NUTCH-1932:
----------------------------------------
Hi [~markus17], can you explain how the removal of orphaned pages is done?
# "orphaned" means a 404 without any (internal) links pointing to it, right?
# to exclude dead external links, NUTCH-1913 is required. In this case, a
second linkdb is used which contains exclusively internal links?
# how are dead links removed from the linkdb. Normally, the linkdb is built
incrementally over time and filled segment by segment. There is no mechanism to
remove links which are now removed from a page, unlike in WebGraph, outgoing
edges are not stored which makes it hard to remove dead links from linkdb. Of
course, if a second "internal" linkdb is used, it could be filled with recent
segments only.
> Automatically remove orphaned pages
> -----------------------------------
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink
> count of 1 is enough.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)