Hi Nutch Team, before I permanently reject Nutch from all my sites, I better tell you why...your URL parser is extremely faulty and creates a lot of trouble.
Here is an example, if you have a link on a page, say: http://www.somesite/somepage/ and the link in HTML looks like: <a href=".">This Page</a> the parser should identify that the "." (dot) refers to this URL: http://www.somesite/somepage/ and not to: http://www.somesite/somepage/. Every single browser does it correctly, why not Nutch? Why is this important? Many new sites don't use the traditional mapping of directories from the URL model anymore, but instead have controllers, actions, parameters etc. encoded in the URL. They get split by a separator, which often is "/" (slash), so if there is a trailing dot, it requests a different resource than without the dot. By ignoring the dot in the backend to cope with Nutch' faulty parser would create at least 2 URL's sending the same content, which then again might affect your Google ranking. Also, Nutch parses "compressed" Javascript files, which are all written in one long line, then somehow take part of the code and add it to the URL, creating a huge array of 404's on the server side. Example, you have a URL to a Javascript file like this: http://www.somesite/javascript/foo.js Nutch parses this and then accesses random (?) new URLs which look like: http://www.somesite/javascript/someFunction(); etc etc. Please, please, please fix Nutch! Thanks, Juergen -- Shakodo - The road to profitable photography: http://www.shakodo.com/

