[
https://issues.apache.org/jira/browse/NUTCH-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma closed NUTCH-363.
-------------------------------
Bulk close of resolved issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira
> Fetcher normalizes everything at least twice
> --------------------------------------------
>
> Key: NUTCH-363
> URL: https://issues.apache.org/jira/browse/NUTCH-363
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8
> Environment: OS X 10.4.7
> Reporter: Doug Cook
> Priority: Minor
> Fix For: 2.0
>
>
> New links are normalized twice by the fetcher:
> First in DOMContentUtils.getOutlinks, where the constructor
> Outlink(url.toString(), linkText.toString().trim(), conf) normalizes the URL.
> The second time is in ParseOutputFormat.write().
> For some URLs (e.g. those repeated on a page) a given URL may be normalized a
> number of times, but it is always normalized at least twice.
> For those of us with expensive normalizations, this is probably burning some
> CPU.
> I'd gladly fix this, but I'm not yet familiar enough with the code to know if
> there are some hidden assumptions which rely on this behavior.
> [A related note is that URLs are normalized *before* filtering; this is
> causing a lot of extra normalization as well. In general, filters may not be
> safe to run before normalization, but there is likely a class of them which
> are (filtering out .gif/.jpg etc). Perhaps the notion of a "pre-normalizer
> filter" would be a useful one?]
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira