[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746034#comment-14746034
]
Markus Jelsma commented on NUTCH-1932:
--------------------------------------
Hello Sebastian. I am not sure about that being on the list. Duplicates are not
removed from the CrawlDb, at least not in CrawlDbReducer, nor in CrawlDbFilter.
The deduplicator marks them as duplicate (equal signatures so far) so they can
be removed from the index by the CleaningJob.
We could, however, move the removal code of orphans we have now in the patch,
to CrawlDbFilter, next to where 404's are purged. That would at least make the
reducer slightly smaller, which in our case is probably a good thing. I am not
for for marking them as gone and use 404 purging. Because regular 404 will be
found again, and cost resources on large scale if you use adaptive scheduling.
E.g. first they get refetched after small interval, and then increasing.
> Automatically remove orphaned pages
> -----------------------------------
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch,
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch,
> NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned,
> e.g. it has no more other pages linking to it. If a page hasn't been linked
> to after markGoneAfter seconds, the page is marked as gone and is then
> removed by an indexer. If a page hasn't been linked to after markOrphanAfter
> seconds, the page is removed from the CrawlDB.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)