Erik Hatcher wrote: > Please reply to me directly as well, as I'm not on the nutch-dev list > regularly. > > I'm curious ... Googlebot, Yahoo Slurp, and now CazoodleBot (based on
Googlebot and Slurp, too ?? Hey, we're in a pretty good company! ;) > Nutch) are hitting our site at http://www.nines.org and I get all sorts > of invalid links crawled. Is our site doing something wrong in our > markup? Or are all these crawlers flawed by hitting non-sensible URLs? This is a side-effect of an unsophisticated method for extracting URLs from Javascript files - the extractor takes "likely" strings and tries to build absolute URLs out of them. In may cases the strings have nothing to do with URLs. Please see NUTCH-505 for more details. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
