I don't think this is a bug; modern browsers can deal with many HTML errors on "best guess effort" but it does not mean that their guess is always correct... <p><a>my wonderful</p>link text</a> - very common mistake... About URL - yes, I am using some modified pieces of Nutch, with some preprocessing to guess correct URL...
Certain documents aren't indexed due to Webmaster mistakes. -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Paul Tomblin Sent: August-25-09 3:28 PM To: nutch-user Subject: Nutch bug: can't handle urls with spaces in them In my browser, I can see a URL with spaces in it, but when I hover over it, the browser has replaced the spaces with %20s, and when I click on it I get the document. However, when Nutch attempts to follow the link, it doesn't do that, and so it gets a 404. It should do the same thing that web browsers do, or else I'm going to be facing questions from my users about why certain documents aren't indexed even though they can see them just fine. If I do a view source, I can see the URLs with spaces in them: <a href="http://localhost/Documents/pharma/DocSamples/Leg blood clots.htm">Leg blood clots.htm</a><br /> But when I click on them, the URL got converted to: http://localhost/Documents/pharma/DocSamples/Leg%20blood%20clots.htm -- http://www.linkedin.com/in/paultomblin
