[ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313112#comment-14313112
 ] 

Sebastian Nagel commented on NUTCH-1932:
----------------------------------------

ok, understood. "we do it once a month" -- this would roughly mean: remove (set 
to status "gone") all pages "orphaned" since one month, right? 
Removal of orphaned pages is valid use case for sure. I've seen it as a 
requirement, but didn't think about a solution -- just crawl from scratch every 
day (can be done for a site crawl). 
However, I wonder whether the problem could be solved with less effort: no 
extra linkdb, trying not to touch CrawlDbReducer's reduce function (it's 
already too complex). Wouldn't it be possible to do (part of) the work via a 
scoring plugin:
* in updateDbScore(): touch a time stamp in CrawlDatum's meta data if there is 
a (internal) link
* detection of orphaned pages then means only checking this time stamp: if it's 
some configurable time in the past (e.g., one month), assume that the page is 
orphaned.
* deletion (set status to "gone") is currently not possible within a scoring 
filter. But we could think about adding a hook to the scoring filter which is 
called only for items which are not passed to scoring filters now.
* ok, we need to treat not modified pages and redirects (but that's also the 
case for the linkdb solution)

Just a rough idea, I surely missed some gory details. :)

> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old 
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink 
> count of 1 is enough.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to