Erik Hatcher wrote:
> Please reply to me directly as well, as I'm not on the nutch-dev list 
> regularly.
> 
> I'm curious ... Googlebot, Yahoo Slurp, and now CazoodleBot (based on 

Googlebot and Slurp, too ?? Hey, we're in a pretty good company! ;)

> Nutch) are hitting our site at http://www.nines.org and I get all sorts 
> of invalid links crawled.  Is our site doing something wrong in our 
> markup?  Or are all these crawlers flawed by hitting non-sensible URLs?

This is a side-effect of an unsophisticated method for extracting URLs 
from Javascript files - the extractor takes "likely" strings and tries 
to build absolute URLs out of them. In may cases the strings have 
nothing to do with URLs.

Please see NUTCH-505 for more details.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to