[
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14963014#comment-14963014
]
Markus Jelsma commented on NUTCH-1932:
--------------------------------------
We have had it running for over 30 days in production. Using the awesome JEXL
expression feature ;) i dumped records having a high orphan time:
{code}
bin/nutch readdb crawl/crawldb -dump output -format csv -expr "(_orphan_ > 0 &&
`date -u +%s` > (_orphan_ + (30 * 86400)))"
{code}
But none of these records were marked gone or orphan. This is because the
reducer doesn't call the scoring filters for unmodified URL's. This is bad
news, meaning no work can be performed on ALL url's regardless of state. I'll
attach another patch that always passes a record through the scoring filter,
which makes sense to me.
> Automatically remove orphaned pages
> -----------------------------------
>
> Key: NUTCH-1932
> URL: https://issues.apache.org/jira/browse/NUTCH-1932
> Project: Nutch
> Issue Type: New Feature
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch,
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch,
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch,
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned,
> e.g. it has no more other pages linking to it. If a page hasn't been linked
> to after markGoneAfter seconds, the page is marked as gone and is then
> removed by an indexer. If a page hasn't been linked to after markOrphanAfter
> seconds, the page is removed from the CrawlDB.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)