[ 
https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125931#comment-13125931
 ] 

Andrzej Bialecki  commented on NUTCH-1142:
------------------------------------------

+1, the patch looks good.

(There is one philosophical :) aspect of this change, as with any situation 
where you calculate PageRank in presence of URL filtering: does it matter that 
a page was linked to from other pages that you decided to filter out? I.e. in 
Pagerank the relative page importance is a function of in-degree, and by 
filtering out incoming links you change the in-degree. This essentially means 
that you decide to ignore some evidence of a page being possibly more 
important, due to links from pages that may not be interesting to you but which 
still do exist. OTOH the incoming links may have been spam, so one would expect 
that in the grand picture it evens out.)
                
> Normalization and filtering in WebGraph
> ---------------------------------------
>
>                 Key: NUTCH-1142
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1142
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1142-1.4.patch, NUTCH-1142-1.5-2.patch, 
> NUTCH-1142-1.5-3.patch
>
>
> The WebGraph programs performs URL normalization. Since normalization of 
> outlinks is already performed during the parse it should become optional. 
> There is also no URL filtering mechanism in the web graph program. When a 
> CrawlDatum is removed from the CrawlDB by an URL filter is should be possible 
> to remove it from the web graph as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to