Erik Hatcher wrote:
Please reply to me directly as well, as I'm not on the nutch-dev list
regularly.
I'm curious ... Googlebot, Yahoo Slurp, and now CazoodleBot (based on
Googlebot and Slurp, too ?? Hey, we're in a pretty good company! ;)
Nutch) are hitting our site at http://www.nines.org and I get all sorts
of invalid links crawled. Is our site doing something wrong in our
markup? Or are all these crawlers flawed by hitting non-sensible URLs?
This is a side-effect of an unsophisticated method for extracting URLs
from Javascript files - the extractor takes "likely" strings and tries
to build absolute URLs out of them. In may cases the strings have
nothing to do with URLs.
Please see NUTCH-505 for more details.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com