You were right about the timing. Sorry about that.
Please ignore that part of the email. However, we are getting about 500 errors per day between the two bots that are accessing our server. As far as the "§ion" being replaced with "§ion"... Under the file ... nutch\branches\branch-0.8\src\java\org\apache\nutch\html\Entities.java there is an area adding special characters [add("§", 167);]. However, I believe that those special characters are supposed to start with & and end in ; (ie: § or ). I have not recompiled the code, yet, but I believe that this should remedy the problem. Please keep me informed to your progress. Once again, sorry about the timing thing. Thanks. -----Original Message----- From: Dmitri Loguinov [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 26, 2006 7:09 PM To: [EMAIL PROTECTED] Cc: Hsin-Tsang Lee Subject: Re: Nutch Problems (0.8-dev) Dear Fred, Thanks for your feedback. We'll look into this issue and fix it shortly. Are these URLs generated by javascript inside your pages? As far as the frequency of hits, do you mind providing specific times and pages loaded by IRLbot? Our default rate-limiting is 40 seconds per website. If you host many virtual sites on a single IP, the server may receive more hits (each directed at a different virtual site though). Cheers, Dmitri ----- Original Message ----- From: Fred Tyre To: [EMAIL PROTECTED] ; [EMAIL PROTECTED] ; nutch-agent@lucene.apache.org Cc: Jon Sent: Wednesday, July 26, 2006 5:29 PM Subject: Nutch Problems (0.8-dev) Our web server has been receiving a lot of failing traffic from shopping.com and irl.cs.tamu.edu I believe your crawler is seeing "§ion" and replacing it with "§ion" http://www.businessair.com/avdealers.cfm?alpha_choice=ALL§ion=AC&first_s ort_ by_column=DEALRSTATE&sort_by_columns=DEALRSTATEDESC,DEALRNAME,CITY,DESCR IPTI ON The URL should be... http://www.businessair.com/avdealers.cfm?alpha_choice=ALL§ion=AC&fir st_s ort_by_column=DEALRSTATE&sort_by_columns=DEALRSTATEDESC,DEALRNAME,CITY,D ESCR IPTION This would only be a minor problem, except that your bot is sending several requests while only waiting a second between requests. A typical user can only click on the page a few seconds after the request has been fulfilled. Therefore, a request should only be made every 15-20 seconds at the most. It doesn't look like your bot even waited for the page to finish loading. Otherwise, a system admin could see the above actions as a Denial Of Service attack. As far as the "§ion" being replaced with "§ion"... Under the file ... nutch\html\Entities.java there is an area adding special characters. However, I believe that those special characters are supposed to start with & and end in ; (ie: § or  ). I have not recompiled the code, yet, but I believe that this should remedy the problem. Please keep me informed to your progress, or I will be forced to block your bots (which I would prefer not to do). Thanks. Sincerely, Fred ><><><><><><><><><><><><><><><><><>< Fred Tyre Information Services Heartland Communications, Inc. 515-574-2147 [EMAIL PROTECTED] ><><><><><><><><><><><><><><><><><><