RE: [Robots] Yahoo evolving robots.txt, finally

Matthew Meadows Sat, 13 Mar 2004 11:36:15 -0800

I'm standing firm on my suggestions.  Adding a delay for crawlers is a
good idea in concept, 
and allowing fractional seconds is a way for webmasters to request
reasonable constraints.  Is
it such a stretch to allow a robot that you use to promote your business
unmitigated access to
your site, but require other robots to throttle down to a few pages per
second?

As for preferred scanning windows, many organizations have a huge surge
of traffic from customers
during their normal operating hours, but are relatively calm otherwise.
Requesting that robots
only scan outside of peak hours is a nice compromise between keeping
them out entirely and keeping
them out when you're too busy serving pages to human readers.

I just read Walter's response to this thread, and he mentions
bytes-per-day and pages-per-day
limits.  Those are fine in the abstract and may be helpful.  But if a
robot is limited to 100MB 
a day and it decides to take them all in one draw during your peak
traffic hours, then volume
limits alone are not sufficient.

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
On Behalf Of [EMAIL PROTECTED]
Sent: Saturday, March 13, 2004 4:31 AM
To: Internet robots, spiders, web-walkers, etc.
Subject: RE: [Robots] Yahoo evolving robots.txt, finally

--- Matthew Meadows <[EMAIL PROTECTED]> wrote:
> I agree with Walter.

So do I, partially. :)

> There's a lot of variables that should have
> been
> considered for this new value.  If nothing else the specification 
> should have called for the time in milliseconds, or otherwise allow 
> for fractional seconds.

I disagree that level of granularity is needed.  See my earlier email.

>  In addition, it seems a bit presumptuous for Yahoo
> to think that they can force a de facto standard just by implementing 
> it first.

That's how things work in real life.  Think web browsers 10 years ago
and various Netscape, then IE extensions.  Now lots of them are
considered standard.

> With this line of thinking webmasters would eventually be required to 
> update their robots.txt file for dozens of individual bots.

In theory, yes.  In reality, I agree with Walter, this extension will
prove to be as useless as "<blink>", and will therefore not be supported
by any big crawlers.

> It's hard enough to get them to do it now for the general case, this 
> additional fragmentation is not going to make anybody's job easier. Is
> Google going to implement their own extensions, then MSN, AltaVista,
> and AllTheWeb?

Not likely.  In order for them to remain competitive, they have to keep
fetching web pages at high rates.  robots.txt only limits them.  I can't
think of an extension to robots.txt that would let them do a better job.
Actually, I can. :)

> Finally, if we're going to start specifying the criteria for 
> scheduling, let's consider some other alternatives, like preferred 
> scanning windows.

Same as crawl-delay - everyone would want crawlers to visit their sites
at night, which would saturate crawlers' networks, so search engines
won't push that extension. (actually, big crawlers run from multiple
points around the planet, so maybe my statement is flawed)

Otis

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> On Behalf Of Walter Underwood
> Sent: Friday, March 12, 2004 3:37 PM
> To: Internet robots, spiders, web-walkers, etc.
> Subject: Re: [Robots] Yahoo evolving robots.txt, finally
> 
> 
> --On Friday, March 12, 2004 6:46 AM -0800 [EMAIL PROTECTED]
> wrote:
> >
> > I am surprised that after all that talk about adding new semantic
> > elements to robots.txt several years ago, nobody commented that the
> 
> > new Yahoo crawler (former Inktomi crawler) took a brave step in
> that
> > direction by adding "Crawl-delay:" syntax.
> > 
> > http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
> > 
> > Time to update your robots.txt parsers!
> 
> No, time to tell Yahoo to go back and do a better job.
> 
> Does crawl-delay allow decimals? Negative numbers? Could this spec be 
> a bit better quality? The words "positive integer" would improve 
> things a
> lot.
> 
> Sigh. It would have been nice if they'd discussed this on the list 
> first. "crawl-delay" is a pretty dumb idea. Any value over one second 
> means it takes forever to index a site. Ultraseek has had a "spider 
> throttle" option to add this sort of delay, but it is
> almost never used, because Ultraseek reads 25 pages from one site,
> then
> moves to another. There are many kinds of rate control.
> 
> wunder
> --
> Walter Underwood
> Principal Architect
> Verity Ultraseek
> 
> _______________________________________________
> Robots mailing list
> [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
> _______________________________________________
> Robots mailing list
> [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots

_______________________________________________
Robots mailing list
[EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

RE: [Robots] Yahoo evolving robots.txt, finally

Reply via email to