> From: Fuad Efendi
> I already posted here that URL Normalizer is called after extracting
> Outlinks from a Page.
-I was _wrong_, sorry.
Code from Injector:
try {
url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
url = filters.filter(url); // filter the url
} catch (Exception e) {
You have to ensure that Nutch uses proper config file (with correct
normalizer)
Perl5Compiler in Java should use encoded \\s instead of \s; I am not sure if
one can use whitespace character inside XML node
P.S.
Some "normalizers" in NUTCH are synchronized singletons and you will have
obvious performance bottleneck.