malformed URL

Sunnyvale Fl Fri, 03 Feb 2006 15:58:16 -0800

I crawled some internal sites and found that URLs with '<' and '>'
characters are fetched and indexed, while these are usually just bad links.
I'd like to have nutch throw a malformed URL error like what it does for '['
and whitespace and some others.  I know I can have '<' and '>' escaped in
the regex-urlfilter.txt file, but I do want to know that these bad links
exist and that is why I'd like to have nutch treat them as malformed URL.
Is there any conf file that I am missing, or I need to tweak the code?


A related question - if I want to have certain characters in the URL encoded
(such as '<' to %3C), what is the best approach?  Thanks!

malformed URL

Reply via email to