:get}"

Andrzej Bialecki Tue, 10 Jul 2007 06:37:34 -0700

Erik Hatcher wrote:

Please reply to me directly as well, as I'm not on the nutch-dev listregularly.
I'm curious ... Googlebot, Yahoo Slurp, and now CazoodleBot (based on


Googlebot and Slurp, too ?? Hey, we're in a pretty good company! ;)

Nutch) are hitting our site at http://www.nines.org and I get all sortsof invalid links crawled. Is our site doing something wrong in ourmarkup? Or are all these crawlers flawed by hitting non-sensible URLs?

This is a side-effect of an unsophisticated method for extracting URLsfrom Javascript files - the extractor takes "likely" strings and triesto build absolute URLs out of them. In may cases the strings havenothing to do with URLs.


Please see NUTCH-505 for more details.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Fwd: [Collex] application#index (ActionController::RoutingError) "no route found to match \"/nines/ escape(document.title) u,\" with {:method=>:get}"

Reply via email to