[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313112#comment-14313112
]
Sebastian Nagel commented on NUTCH-1932:
----------------------------------------
ok, understood. "we do it once a month" -- this would roughly mean: remove (set
to status "gone") all pages "orphaned" since one month, right?
Removal of orphaned pages is valid use case for sure. I've seen it as a
requirement, but didn't think about a solution -- just crawl from scratch every
day (can be done for a site crawl).
However, I wonder whether the problem could be solved with less effort: no
extra linkdb, trying not to touch CrawlDbReducer's reduce function (it's
already too complex). Wouldn't it be possible to do (part of) the work via a
scoring plugin:
* in updateDbScore(): touch a time stamp in CrawlDatum's meta data if there is
a (internal) link
* detection of orphaned pages then means only checking this time stamp: if it's
some configurable time in the past (e.g., one month), assume that the page is
orphaned.
* deletion (set status to "gone") is currently not possible within a scoring
filter. But we could think about adding a hook to the scoring filter which is
called only for items which are not passed to scoring filters now.
* ok, we need to treat not modified pages and redirects (but that's also the
case for the linkdb solution)
Just a rough idea, I surely missed some gory details. :)
> Automatically remove orphaned pages
> -----------------------------------
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1932.patch
>
>
> Nutch should be able to automatically remove orphaned pages such as old
> 404's, and not continue to revisit them. This requires NUTCH-1913. An inlink
> count of 1 is enough.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)