You were right about the timing.

Sorry about that.

Please ignore that part of the email.

However, we are getting about 500 errors per day between the two bots that
are accessing our server.

As far as the "&section" being replaced with "§ion"...

Under the file ...
   nutch\branches\branch-0.8\src\java\org\apache\nutch\html\Entities.java

  there is an area adding special characters [add("&sect",   167);].
However, I believe that those
     special characters are supposed to start with & and end in ; (ie:
§ or  ).
     I have not recompiled the code, yet, but I believe that this should
remedy the problem.

Please keep me informed to your progress.

Once again, sorry about the timing thing.

Thanks.
  -----Original Message-----
  From: Dmitri Loguinov [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, July 26, 2006 7:09 PM
  To: [EMAIL PROTECTED]
  Cc: Hsin-Tsang Lee
  Subject: Re: Nutch Problems (0.8-dev)


  Dear Fred,

  Thanks for your feedback. We'll look into this issue and fix it shortly.
Are these URLs generated by javascript inside your pages? As far as the
frequency of hits, do you mind providing specific times and pages loaded by
IRLbot? Our default rate-limiting is 40 seconds per website. If you host
many virtual sites on a single IP, the server may receive more hits (each
directed at a different virtual site though).

  Cheers,
  Dmitri
    ----- Original Message -----
    From: Fred Tyre
    To: [EMAIL PROTECTED] ; [EMAIL PROTECTED] ;
nutch-agent@lucene.apache.org
    Cc: Jon
    Sent: Wednesday, July 26, 2006 5:29 PM
    Subject: Nutch Problems (0.8-dev)



    Our web server has been receiving a lot of failing traffic from
shopping.com
    and irl.cs.tamu.edu

    I believe your crawler is seeing "&section" and replacing it with "§ion"

    http://www.businessair.com/avdealers.cfm?alpha_choice=ALL§ion=AC&first_s
ort_
    by_column=DEALRSTATE&sort_by_columns=DEALRSTATEDESC,DEALRNAME,CITY,DESCR
IPTI
    ON

    The URL should be...

    http://www.businessair.com/avdealers.cfm?alpha_choice=ALL&section=AC&fir
st_s
    ort_by_column=DEALRSTATE&sort_by_columns=DEALRSTATEDESC,DEALRNAME,CITY,D
ESCR
    IPTION

    This would only be a minor problem, except that your bot is sending
several
    requests while only waiting a second between requests.

    A typical user can only click on the page a few seconds after the
request
    has been fulfilled.  Therefore, a request should only be made every
15-20
    seconds at the most.

    It doesn't look like your bot even waited for the page to finish
loading.

    Otherwise, a system admin could see the above actions as a Denial Of
Service
    attack.

    As far as the "&section" being replaced with "§ion"...

    Under the file ...
       nutch\html\Entities.java

      there is an area adding special characters.  However, I believe that
those
    special characters are supposed to start with
         & and end in ; (ie: § or &nbsp).  I have not recompiled the
code,
    yet, but I believe that this should remedy the
         problem.

    Please keep me informed to your progress, or I will be forced to block
your
    bots (which I would prefer not to do).

    Thanks.

    Sincerely,
    Fred

    ><><><><><><><><><><><><><><><><><><
       Fred Tyre
       Information Services
       Heartland Communications, Inc.
       515-574-2147
       [EMAIL PROTECTED]
    ><><><><><><><><><><><><><><><><><><



Reply via email to