It looks to me as though Yahoo has some sort of robot defense operating.  I
was just testing a multi-threaded robot that I use to analyze discussions,
including Yahoo's stock market boards.  On the first run, it seemed to do
fine, but when I tried to run it again a few minutes later, it didn't
retrieve anything... so I tried going to the message boards using IE on the
same machine.  Every page is returning a 403 Forbidden error now -- even
when I try to see robots.txt.  As far as I know, Yahoo has never even had a
robots.txt file.

I'm guessing that the speed of my robot triggered a block against this IP
address.  Another machine, in the same subnet, can access the pages just
fine.

I've been working on the underlying database for the last few weeks, so I
haven't run the spider lately.  Thus, I'm not sure when this behavior might
hvae started.

My robot is quite fast and my connection yields throughput of about 1
mbit/sec, so it certainly hit their server fairly hard.  But hey, it's
Yahoo.  If they can't handle getting hit this hard on a mid-day Saturday,
it's hard to imagine who can.

No lectures about well-behaved robots, please.  I know, I know.  The next
step for that robot will be to have each thread hit completely different
domains.  Perhaps each one will rotate through a few domains.

Anybody know what Yahoo might be doing, or what its policy is about robots?
I haven't been able to find anything that addresses the issue directly.  I
don't see anything under its TOS that would clearly apply.  If they want to
have a limit on robots, I sure would appreciate it if they would say what it
is...

It's been about 30 minutes now and I'm still blocked, it seems.

Just checked from another machine -- they still have no robots.txt at all.

Nick

--
[EMAIL PROTECTED]
(408) 904-7198


Reply via email to