[
https://issues.apache.org/jira/browse/NUTCH-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125940#comment-13125940
]
Markus Jelsma commented on NUTCH-1142:
--------------------------------------
You are right, of course, although the segments we feed are usually already
filtered and normalized by ParseOutputFormat. The same is true for the
invertlinks program which is analogous to the parts of the webgraph program.
I prefer a webgraph that represents the contents of a crawldb ;)
Ah well, it's optional. Thanks for sharing your thoughts Andrzej.
> Normalization and filtering in WebGraph
> ---------------------------------------
>
> Key: NUTCH-1142
> URL: https://issues.apache.org/jira/browse/NUTCH-1142
> Project: Nutch
> Issue Type: Improvement
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1142-1.4.patch, NUTCH-1142-1.5-2.patch,
> NUTCH-1142-1.5-3.patch
>
>
> The WebGraph programs performs URL normalization. Since normalization of
> outlinks is already performed during the parse it should become optional.
> There is also no URL filtering mechanism in the web graph program. When a
> CrawlDatum is removed from the CrawlDB by an URL filter is should be possible
> to remove it from the web graph as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira