Thanks Scott!
Since I wrote this email - which I thought got ignored by the
Nutch developers - I am getting bombed on my server by 2 especially
annoying and non-reacting companies which use Nutch. The companies
(and Nutch) are both blocked by my robots.txt file, see:
http://www.shakodo.com/robots.txt
but while they both access this file a couple of times
per day, they ignore it completely.
The company http://www.lijit.com/ called me an "idiot" to
complain about their faulty configuration and the other
company http://www.comodo.com/ ignored every complaint.
Can you please check if my robots.txt file has the correct
syntax and if I reject Nutch in general correctly or can you
please help me to fix the syntax that Nutch powered crawler
don't access our server(s) anymore? If the syntax in fact is
correct, then I must assume that at least these 2 companies
altered the source to actively abuse the robots.txt rules.
Doesn't this violate your license?
Help is appreciated!
Juergen
--
Shakodo - The road to profitable photography: http://www.shakodo.com/
On 3/4/11 10:40 AM, Scott Gonyea wrote:
Has anyone looked into this? This is especially a problem when folks
like Juergen are a customer and, quite rightfully, raise hell. I
wasn't aware of this, since Nutch is a software metaphor for a
firehose. But what I have noticed is that the URL Parser is really,
really terrible. Expletive-worthy.
The problem I am experiencing is the lack of subdomain support.
Dumping thousands of regexes into a flatfile is a terrible hack. More
than that, pushing meta-data down through a given site becomes
unreliable. If one site links to another, and that sites links are
crawled, your meta data is now unreliable.
Etc. I don't want to come across as whiney, but I just did. I really
think Nutch needs to hunker down tests. I'm guilty of not caring
about it myself, but it's because testing Java is pretty painful
compared to BDD tools like RSpec:
http://www.codecommit.com/blog/java/the-brilliance-of-bdd
Scott
On Fri, Feb 25, 2011 at 4:08 PM, Juergen Specht
<[email protected]> wrote:
Hi Nutch Team,
before I permanently reject Nutch from all my sites, I better tell
you why...your URL parser is extremely faulty and creates a lot of
trouble.
Here is an example, if you have a link on a page, say:
http://www.somesite/somepage/
and the link in HTML looks like:
<a href=".">This Page</a>
the parser should identify that the "." (dot) refers
to this URL:
http://www.somesite/somepage/
and not to:
http://www.somesite/somepage/.
Every single browser does it correctly, why not Nutch?
Why is this important? Many new sites don't use the traditional
mapping of directories from the URL model anymore, but instead
have controllers, actions, parameters etc. encoded in the URL.
They get split by a separator, which often is "/" (slash), so if
there is a trailing dot, it requests a different resource than
without the dot. By ignoring the dot in the backend to cope with
Nutch' faulty parser would create at least 2 URL's sending the
same content, which then again might affect your Google ranking.
Also, Nutch parses "compressed" Javascript files, which are all
written in one long line, then somehow take part of the code and
add it to the URL, creating a huge array of 404's on the server
side.
Example, you have a URL to a Javascript file like this:
http://www.somesite/javascript/foo.js
Nutch parses this and then accesses random (?) new URLs which look like:
http://www.somesite/javascript/someFunction();
etc etc.
Please, please, please fix Nutch!
Thanks,
Juergen
--
Shakodo - The road to profitable photography: http://www.shakodo.com/