[ 
https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1142:
---------------------------------

    Description: 
The WebGraph programs performs URL normalization. Since normalization of 
outlinks is already performed during the parse it should become optional. There 
is also no URL filtering mechanism in the web graph program. When a CrawlDatum 
is removed from the CrawlDB by an URL filter is should be possible to remove it 
from the web graph as well.


  was:The WebGraph programs performs URL normalization. Since normalization of 
outlinks is already performed during the parse it should become optional.

     Patch Info: Patch Available
        Summary: Normalization and filtering in WebGraph  (was: Normalization 
optional in WebGraph)
    
> Normalization and filtering in WebGraph
> ---------------------------------------
>
>                 Key: NUTCH-1142
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1142
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>
> The WebGraph programs performs URL normalization. Since normalization of 
> outlinks is already performed during the parse it should become optional. 
> There is also no URL filtering mechanism in the web graph program. When a 
> CrawlDatum is removed from the CrawlDB by an URL filter is should be possible 
> to remove it from the web graph as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to