Has anyone looked into this? This is especially a problem when folks like Juergen are a customer and, quite rightfully, raise hell. I wasn't aware of this, since Nutch is a software metaphor for a firehose. But what I have noticed is that the URL Parser is really, really terrible. Expletive-worthy.
The problem I am experiencing is the lack of subdomain support. Dumping thousands of regexes into a flatfile is a terrible hack. More than that, pushing meta-data down through a given site becomes unreliable. If one site links to another, and that sites links are crawled, your meta data is now unreliable. Etc. I don't want to come across as whiney, but I just did. I really think Nutch needs to hunker down tests. I'm guilty of not caring about it myself, but it's because testing Java is pretty painful compared to BDD tools like RSpec: http://www.codecommit.com/blog/java/the-brilliance-of-bdd Scott On Fri, Feb 25, 2011 at 4:08 PM, Juergen Specht <[email protected]> wrote: > Hi Nutch Team, > > before I permanently reject Nutch from all my sites, I better tell > you why...your URL parser is extremely faulty and creates a lot of > trouble. > > Here is an example, if you have a link on a page, say: > > http://www.somesite/somepage/ > > and the link in HTML looks like: > > <a href=".">This Page</a> > > the parser should identify that the "." (dot) refers > to this URL: > > http://www.somesite/somepage/ > > and not to: > > http://www.somesite/somepage/. > > Every single browser does it correctly, why not Nutch? > > Why is this important? Many new sites don't use the traditional > mapping of directories from the URL model anymore, but instead > have controllers, actions, parameters etc. encoded in the URL. > > They get split by a separator, which often is "/" (slash), so if > there is a trailing dot, it requests a different resource than > without the dot. By ignoring the dot in the backend to cope with > Nutch' faulty parser would create at least 2 URL's sending the > same content, which then again might affect your Google ranking. > > Also, Nutch parses "compressed" Javascript files, which are all > written in one long line, then somehow take part of the code and > add it to the URL, creating a huge array of 404's on the server > side. > > Example, you have a URL to a Javascript file like this: > > http://www.somesite/javascript/foo.js > > Nutch parses this and then accesses random (?) new URLs which look like: > > http://www.somesite/javascript/someFunction(); > > etc etc. > > Please, please, please fix Nutch! > > Thanks, > > Juergen > -- > Shakodo - The road to profitable photography: http://www.shakodo.com/ >

