[ 
https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126215#comment-16126215
 ] 

ASF GitHub Bot commented on NUTCH-1932:
---------------------------------------

lewismc commented on a change in pull request #211: NUTCH-1932 Automatically 
remove orphaned pages
URL: https://github.com/apache/nutch/pull/211#discussion_r133030580
 
 

 ##########
 File path: src/java/org/apache/nutch/scoring/ScoringFilter.java
 ##########
 @@ -179,6 +179,20 @@ public void updateDbScore(Text url, CrawlDatum old, 
CrawlDatum datum,
       List<CrawlDatum> inlinked) throws ScoringFilterException;
 
   /**
+   * This method may change the score or status of CrawlDatum during CrawlDb
+   * update, when the URL is neither fetched nor has any inlinks.
+   *
+   * @param url
+   *          URL of the page
+   * @param datum
+   *          CrawlDatum for page
+   * @throws ScoringFilterException
 
 Review comment:
   I think this 'may' break Javadoc generation if no comment is provided 
alongside the Exception itself.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, 
> NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, 
> e.g. it has no more other pages linking to it. If a page hasn't been linked 
> to after markGoneAfter seconds, the page is marked as gone and is then 
> removed by an indexer.  If a page hasn't been linked to after markOrphanAfter 
> seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to