Re: Nutch Parser annoyingly faulty
Hi Jurgen, Since I wrote this email - which I thought got ignored by the Nutch developers - Thanks for reporting the problem Jurgen. and sorry that you felt you were being ignored. The few active developers Nutch has contribute during their spare time, the reason why you did not get any comments on this, is that no one had an instant answer or time to investigate in more details. You definitely raised an important issue which is worth investigating. To answer your first email : the javascript parser is notoriously noisy and generates all sorts of monstrosities. It used to be activated by default but this won't be the case as of the forthcoming 1.3 release. I have not been able to reproduce the issue with the dot though. I put this html a href=.This Page/a /html on our server : http://www.digitalpebble.com/dummy.html ran : ./nutch org.apache.nutch.parse.ParserChecker http://www.digitalpebble.com/dummy.html and got Outlinks: 1 outlink: toUrl: http://www.digitalpebble.com/ anchor: This Page as expected. Any particular URL on your site that you had this problem with? I am getting bombed on my server by 2 especially annoying and non-reacting companies which use Nutch. The companies (and Nutch) are both blocked by my robots.txt file, see: http://www.shakodo.com/robots.txt but while they both access this file a couple of times per day, they ignore it completely. The company http://www.lijit.com/ called me an idiot to complain about their faulty configuration and the other company http://www.comodo.com/ ignored every complaint. By default, Nutch does respect robots.txt and the community as a whole encourages server-politeness and reasonable use however we can't prevent people from using ridiculous settings (e.g. high number of threads per host, low time gap between calls) or modifying the code to bypass the robots checking (see my comment below) Can you please check if my robots.txt file has the correct syntax and if I reject Nutch in general correctly or can you please help me to fix the syntax that Nutch powered crawler don't access our server(s) anymore? I have checked your robots.txt and it looks correct. I tried parsing http://www.shakodo.com with the user-agents you specified, Nutch fully respected robots.txt and the content has not been fetched If the syntax in fact is correct, then I must assume that at least these 2 companies altered the source to actively abuse the robots.txt rules. That's indeed a possibility Doesn't this violate your license? not as far as I know. The Apache license allows people to modify the code, most people do that for positive reasons and unfortunately we can't prevent people from bypassing the robots check. Help is appreciated! Another option is to see if the companies you want to block use constantly the same IP range and configure your servers so that they prevent access to these IPs. You could file a complain with the company hosting the crawl, I know that Amazon are pretty reactive with EC2 and would take measures to make sure their users do the right things Thanks Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
Re: Nutch Parser annoyingly faulty
Hi Julien, On 3/4/11 7:09 PM, Julien Nioche wrote: Thanks for reporting the problem Jurgen. and sorry that you felt you were being ignored. The few active developers Nutch has contribute during their spare time, the reason why you did not get any comments on this, is that no one had an instant answer or time to investigate in more details. You definitely raised an important issue which is worth investigating. thanks for taking the time to reply and checking my settings! To answer your first email : the javascript parser is notoriously noisy and generates all sorts of monstrosities. It used to be activated by default but this won't be the case as of the forthcoming 1.3 release. I see. Monstrosities describes it quite well :) I have not been able to reproduce the issue with the dot though. I Any particular URL on your site that you had this problem with? No, its not on particular URLs, but all over the place. However, I just checked and it seems to happen with Nutch 0.9 and 1.0, here is an example: 216.24.131.152 - - [26/Feb/2011:00:53:44 +0900] GET /assignments/tags/advertisement/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:55:03 +0900] GET /assignments/tags/assignments_design/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:55:56 +0900] GET /assignments/tags/assignments_commercial-photography/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:56:19 +0900] GET /assignments/tags/apartment_rental/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:57:09 +0900] GET /assignments/tags/assignments_church/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:57:26 +0900] GET /assignments/tags/assignments_corporate/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:57:44 +0900] GET /assignments/tags/assignments_cd-cover/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:58:16 +0900] GET /assignments/tags/amateur_assignments/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:58:18 +0900] GET /assignments/tags/assignments_event/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) 216.24.131.152 - - [26/Feb/2011:00:59:16 +0900] GET /assignments/tags/agent/. HTTP/1.0 404 820 - Lijit Crawler/Nutch-0.9 (Reports crawler; http://www.lijit.com/robot/crawler; info(a)lijit(d)com) By default, Nutch does respect robots.txt and the community as a whole encourages server-politeness and reasonable use however we can't prevent people from using ridiculous settings (e.g. high number of threads per host, low time gap between calls) or modifying the code to bypass the robots checking (see my comment below) Understand. I have checked your robots.txt and it looks correct. I tried parsing http://www.shakodo.com with the user-agents you specified, Nutch fully respected robots.txt and the content has not been fetched Thanks a lot for the confirmation! That's indeed a possibility And now also confirmed. I might add another disallow: /badrobot/ trap in my robots.txt to see if I get more violations. Doesn't this violate your license? not as far as I know. The Apache license allows people to modify the code, most people do that for positive reasons and unfortunately we can't prevent people from bypassing the robots check. Too bad, but you can use a hammer to put a nail into the wall (useful) or to put a nail into somebodies head (not so useful - with exceptions). Another option is to see if the companies you want to block use constantly the same IP range and configure your servers so that they prevent access to these IPs. You could file a complain with the company hosting the crawl, I know that Amazon are pretty reactive with EC2 and would take measures to make sure their users do the right things They are already blocked with most existing IPs I could find, plus I reported them to their ISPs, but they seem to have better arguments (i.e. they pay their ISPs) than I have. Anyway, thanks a lot for checking and coming back to me with info, very much appreciated! I will not add Nutch 1.3 to my disallow rule set! :) Thanks, Juergen -- Shakodo - The road to profitable photography:
Build failed in Jenkins: Nutch-trunk #1416
See https://hudson.apache.org/hudson/job/Nutch-trunk/1416/ -- [...truncated 1009 lines...] A src/plugin/subcollection/src/java/org/apache/nutch/collection A src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java A src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html A src/plugin/subcollection/src/java/org/apache/nutch/indexer A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection A src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java A src/plugin/subcollection/README.txt A src/plugin/subcollection/plugin.xml A src/plugin/subcollection/build.xml A src/plugin/index-more A src/plugin/index-more/ivy.xml A src/plugin/index-more/src A src/plugin/index-more/src/test A src/plugin/index-more/src/test/org A src/plugin/index-more/src/test/org/apache A src/plugin/index-more/src/test/org/apache/nutch A src/plugin/index-more/src/test/org/apache/nutch/indexer A src/plugin/index-more/src/test/org/apache/nutch/indexer/more A src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java A src/plugin/index-more/src/java A src/plugin/index-more/src/java/org A src/plugin/index-more/src/java/org/apache A src/plugin/index-more/src/java/org/apache/nutch A src/plugin/index-more/src/java/org/apache/nutch/indexer A src/plugin/index-more/src/java/org/apache/nutch/indexer/more A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java A src/plugin/index-more/src/java/org/apache/nutch/indexer/more/package.html A src/plugin/index-more/plugin.xml A src/plugin/index-more/build.xml AUsrc/plugin/plugin.dtd A src/plugin/parse-ext A src/plugin/parse-ext/ivy.xml A src/plugin/parse-ext/src A src/plugin/parse-ext/src/test A src/plugin/parse-ext/src/test/org A src/plugin/parse-ext/src/test/org/apache A src/plugin/parse-ext/src/test/org/apache/nutch A src/plugin/parse-ext/src/test/org/apache/nutch/parse A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext/TestExtParser.java A src/plugin/parse-ext/src/java A src/plugin/parse-ext/src/java/org A src/plugin/parse-ext/src/java/org/apache A src/plugin/parse-ext/src/java/org/apache/nutch A src/plugin/parse-ext/src/java/org/apache/nutch/parse A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext A src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext/ExtParser.java A src/plugin/parse-ext/plugin.xml A src/plugin/parse-ext/build.xml A src/plugin/parse-ext/command A src/plugin/urlnormalizer-pass A src/plugin/urlnormalizer-pass/ivy.xml A src/plugin/urlnormalizer-pass/src A src/plugin/urlnormalizer-pass/src/test A src/plugin/urlnormalizer-pass/src/test/org A src/plugin/urlnormalizer-pass/src/test/org/apache A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/test/org/apache/nutch/net/urlnormalizer/pass/TestPassURLNormalizer.java A src/plugin/urlnormalizer-pass/src/java A src/plugin/urlnormalizer-pass/src/java/org A src/plugin/urlnormalizer-pass/src/java/org/apache A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer A src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass AU src/plugin/urlnormalizer-pass/src/java/org/apache/nutch/net/urlnormalizer/pass/PassURLNormalizer.java AUsrc/plugin/urlnormalizer-pass/plugin.xml AUsrc/plugin/urlnormalizer-pass/build.xml A src/plugin/parse-html A src/plugin/parse-html/ivy.xml A src/plugin/parse-html/lib A src/plugin/parse-html/lib/tagsoup.LICENSE.txt A src/plugin/parse-html/src A src/plugin/parse-html/src/test A src/plugin/parse-html/src/test/org A src/plugin/parse-html/src/test/org/apache A src/plugin/parse-html/src/test/org/apache/nutch A src/plugin/parse-html/src/test/org/apache/nutch/parse A