I crawled some internal sites and found that URLs with '<' and '>'
characters are fetched and indexed, while these are usually just bad links.
I'd like to have nutch throw a malformed URL error like what it does for '['
and whitespace and some others.  I know I can have '<' and '>' escaped in
the regex-urlfilter.txt file, but I do want to know that these bad links
exist and that is why I'd like to have nutch treat them as malformed URL.
Is there any conf file that I am missing, or I need to tweak the code?

A related question - if I want to have certain characters in the URL encoded
(such as '<' to %3C), what is the best approach?  Thanks!

Reply via email to