I don't think this is a bug; modern browsers can deal with many HTML errors
on "best guess effort" but it does not mean that their guess is always
correct... <p><a>my wonderful</p>link text</a> - very common mistake... 
About URL - yes, I am using some modified pieces of Nutch, with some
preprocessing to guess correct URL...


Certain documents aren't indexed due to Webmaster mistakes.



-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Paul
Tomblin
Sent: August-25-09 3:28 PM
To: nutch-user
Subject: Nutch bug: can't handle urls with spaces in them

In my browser, I can see a URL with spaces in it, but when I hover
over it, the browser has replaced the spaces with %20s, and when I
click on it I get the document.  However, when Nutch attempts to
follow the link, it doesn't do that, and so it gets a 404.  It should
do the same thing that web browsers do, or else I'm going to be facing
questions from my users about why certain documents aren't indexed
even though they can see them just fine.

If I do a view source, I can see the URLs with spaces in them:
<a href="http://localhost/Documents/pharma/DocSamples/Leg blood
clots.htm">Leg blood clots.htm</a><br />

But when I click on them, the URL got converted to:
http://localhost/Documents/pharma/DocSamples/Leg%20blood%20clots.htm


-- 
http://www.linkedin.com/in/paultomblin


Reply via email to