[ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14312497#comment-14312497
 ] 

Markus Jelsma edited comment on NUTCH-1932 at 2/9/15 5:27 PM:
--------------------------------------------------------------

Hi Sebastian,

# it does not need to be a 404, orphan means no incoming hyperlinks;
# you need a second linkdb if you already use them for indexing anchors, or 
that linkdb must also include internal hyperlinks, certainly not recommended;
# indeed, you would need to regenerate the linkdb from scratch each cycle, we 
do it once a month.

When you rebuild the linkdb from scratch everytime, you then also need 
NUTCH-1921. Hyperlinks are not recorded as outlinks for fetch_notmodified 
pages, and hence, you cannot correctly rebuild the linkdb and then incorrectly 
mark pages as orphan,.


was (Author: markus17):
Hi Sebastian,

# it does not need to be a 404, orphan means no incoming hyperlinks;
# you need a second linkdb if you already use them for indexing anchors, or 
that linkdb must also include internal hyperlinks, certainly not recommended;
# indeed, you would need to regenerate the linkdb from scratch each cycle, we 
do it once a month.

When you rebuild the linkdb from scratch everytime, you then also need 
NUTCH-1921. Hyperlinks are not recorded as outlinks for fetch_notmodified 
pages, and hence, you cannot correctly rebuild the linkdb.

> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old 
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink 
> count of 1 is enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to