Has anyone looked into this?  This is especially a problem when folks
like Juergen are a customer and, quite rightfully, raise hell.  I
wasn't aware of this, since Nutch is a software metaphor for a
firehose.  But what I have noticed is that the URL Parser is really,
really terrible.  Expletive-worthy.

The problem I am experiencing is the lack of subdomain support.
Dumping thousands of regexes into a flatfile is a terrible hack.  More
than that, pushing meta-data down through a given site becomes
unreliable.  If one site links to another, and that sites links are
crawled, your meta data is now unreliable.

Etc.  I don't want to come across as whiney, but I just did.  I really
think Nutch needs to hunker down tests.  I'm guilty of not caring
about it myself, but it's because testing Java is pretty painful
compared to BDD tools like RSpec:

http://www.codecommit.com/blog/java/the-brilliance-of-bdd

Scott

On Fri, Feb 25, 2011 at 4:08 PM, Juergen Specht
<[email protected]> wrote:
> Hi Nutch Team,
>
> before I permanently reject Nutch from all my sites, I better tell
> you why...your URL parser is extremely faulty and creates a lot of
> trouble.
>
> Here is an example, if you have a link on a page, say:
>
> http://www.somesite/somepage/
>
> and the link in HTML looks like:
>
> <a href=".">This Page</a>
>
> the parser should identify that the "." (dot) refers
> to this URL:
>
> http://www.somesite/somepage/
>
> and not to:
>
> http://www.somesite/somepage/.
>
> Every single browser does it correctly, why not Nutch?
>
> Why is this important? Many new sites don't use the traditional
> mapping of directories from the URL model anymore, but instead
> have controllers, actions, parameters etc. encoded in the URL.
>
> They get split by a separator, which often is "/" (slash), so if
> there is a trailing dot, it requests a different resource than
> without the dot. By ignoring the dot in the backend to cope with
> Nutch' faulty parser would create at least 2 URL's sending the
> same content, which then again might affect your Google ranking.
>
> Also, Nutch parses "compressed" Javascript files, which are all
> written in one long line, then somehow take part of the code and
> add it to the URL, creating a huge array of 404's on the server
> side.
>
> Example, you have a URL to a Javascript file like this:
>
>  http://www.somesite/javascript/foo.js
>
> Nutch parses this and then accesses random (?) new URLs which look like:
>
> http://www.somesite/javascript/someFunction();
>
> etc etc.
>
> Please, please, please fix Nutch!
>
> Thanks,
>
> Juergen
> --
> Shakodo - The road to profitable photography: http://www.shakodo.com/
>

Reply via email to