[ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15964369#comment-15964369
 ] 

Sebastian Nagel commented on NUTCH-1932:
----------------------------------------

Hi [[email protected]],
after a look on the latest patch (June 30 2016): updateDbScore(...) is now 
called in CrawlDbReducer with swapped arguments (old <> datum) if no fetch has 
taken place and there are no inlinks. Are you sure that all scoring-filter 
plugins behave well in this situation? They should not do anything as before 
(when not called)! But they do not, at least, scoring-depth overwrites the 
tracked depth in this situation. Wouldn't it be clearer to add a new method to 
the ScoringFilter interface to avoid that existing plugins (including custom 
ones) are accidentally broken?

> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, 
> e.g. it has no more other pages linking to it. If a page hasn't been linked 
> to after markGoneAfter seconds, the page is marked as gone and is then 
> removed by an indexer.  If a page hasn't been linked to after markOrphanAfter 
> seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to