I crawled some internal sites and found that URLs with '<' and '>' characters are fetched and indexed, while these are usually just bad links. I'd like to have nutch throw a malformed URL error like what it does for '[' and whitespace and some others. I know I can have '<' and '>' escaped in the regex-urlfilter.txt file, but I do want to know that these bad links exist and that is why I'd like to have nutch treat them as malformed URL. Is there any conf file that I am missing, or I need to tweak the code?
A related question - if I want to have certain characters in the URL encoded (such as '<' to %3C), what is the best approach? Thanks!
