In my browser, I can see a URL with spaces in it, but when I hover over it, the browser has replaced the spaces with %20s, and when I click on it I get the document. However, when Nutch attempts to follow the link, it doesn't do that, and so it gets a 404. It should do the same thing that web browsers do, or else I'm going to be facing questions from my users about why certain documents aren't indexed even though they can see them just fine.
If I do a view source, I can see the URLs with spaces in them: <a href="http://localhost/Documents/pharma/DocSamples/Leg blood clots.htm">Leg blood clots.htm</a><br /> But when I click on them, the URL got converted to: http://localhost/Documents/pharma/DocSamples/Leg%20blood%20clots.htm -- http://www.linkedin.com/in/paultomblin
