[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312497#comment-14312497
]
Markus Jelsma edited comment on NUTCH-1932 at 2/9/15 5:27 PM:
--------------------------------------------------------------
Hi Sebastian,
# it does not need to be a 404, orphan means no incoming hyperlinks;
# you need a second linkdb if you already use them for indexing anchors, or
that linkdb must also include internal hyperlinks, certainly not recommended;
# indeed, you would need to regenerate the linkdb from scratch each cycle, we
do it once a month.
When you rebuild the linkdb from scratch everytime, you then also need
NUTCH-1921. Hyperlinks are not recorded as outlinks for fetch_notmodified
pages, and hence, you cannot correctly rebuild the linkdb and then incorrectly
mark pages as orphan,.
was (Author: markus17):
Hi Sebastian,
# it does not need to be a 404, orphan means no incoming hyperlinks;
# you need a second linkdb if you already use them for indexing anchors, or
that linkdb must also include internal hyperlinks, certainly not recommended;
# indeed, you would need to regenerate the linkdb from scratch each cycle, we
do it once a month.
When you rebuild the linkdb from scratch everytime, you then also need
NUTCH-1921. Hyperlinks are not recorded as outlinks for fetch_notmodified
pages, and hence, you cannot correctly rebuild the linkdb.
> Automatically remove orphaned pages
> -----------------------------------
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink
> count of 1 is enough.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)