[
https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1142:
---------------------------------
Description:
The WebGraph programs performs URL normalization. Since normalization of
outlinks is already performed during the parse it should become optional. There
is also no URL filtering mechanism in the web graph program. When a CrawlDatum
is removed from the CrawlDB by an URL filter is should be possible to remove it
from the web graph as well.
was:The WebGraph programs performs URL normalization. Since normalization of
outlinks is already performed during the parse it should become optional.
Patch Info: Patch Available
Summary: Normalization and filtering in WebGraph (was: Normalization
optional in WebGraph)
> Normalization and filtering in WebGraph
> ---------------------------------------
>
> Key: NUTCH-1142
> URL: https://issues.apache.org/jira/browse/NUTCH-1142
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
>
> The WebGraph programs performs URL normalization. Since normalization of
> outlinks is already performed during the parse it should become optional.
> There is also no URL filtering mechanism in the web graph program. When a
> CrawlDatum is removed from the CrawlDB by an URL filter is should be possible
> to remove it from the web graph as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira