In my browser, I can see a URL with spaces in it, but when I hover
over it, the browser has replaced the spaces with %20s, and when I
click on it I get the document.  However, when Nutch attempts to
follow the link, it doesn't do that, and so it gets a 404.  It should
do the same thing that web browsers do, or else I'm going to be facing
questions from my users about why certain documents aren't indexed
even though they can see them just fine.

If I do a view source, I can see the URLs with spaces in them:
<a href="http://localhost/Documents/pharma/DocSamples/Leg blood
clots.htm">Leg blood clots.htm</a><br />

But when I click on them, the URL got converted to:
http://localhost/Documents/pharma/DocSamples/Leg%20blood%20clots.htm


-- 
http://www.linkedin.com/in/paultomblin

Reply via email to