Re: robot developers?

2001-02-12 Thread Tom Thomson

The white paper contains the very strange statement

[quote]
Finally, a question that exposes the worst flaw of the robots.txt protocol:
a webmaster wishes to make all pages of a Web site, EXCEPT the home page
(i.e. /), accessible to robots; how can she do this using the robots.txt
protocol? The answer - She can't.
[unquote]

This is nonsense, what's happening here is that the Webmaster doesn't
understand websites and has failed to distinguish between the default page
and the home page.  If she wants to allow all pages except the homepage, she
can write a disallow line that explicitly excludes the home page using it's
full path.  She can then have the default page be a client-side redirection
to the home page, so that users other than robots will get the impression
that the home page is the default page.

No fault with the robots.txt protocol here.

Tom Thomson

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Alan
 Perkins
 Sent: 09 February 2001 00:58
 To: [EMAIL PROTECTED]
 Subject: Re: robot developers?


 Here's something for discussion: The robots.txt and robots meta tag
 protocols have serious flaws but there appears to be no concerted
 effort to
 fix them.

 We've published a white paper discussing the flaws in detail at

 www.ebrandmanagement.com/whitepapers/

 Anyone on this list is welcome to read the white paper and start a
 discussion about it.  You will need the following user name and password:

 User name: [EMAIL PROTECTED]
 Password:  [EMAIL PROTECTED]

 Enjoy...

 Alan Perkins, Chief Technology Officer
 e-Brand Management Limited





Re: robot developers?

2001-02-12 Thread Alan Perkins

Tom

 [quote]
 Finally, a question that exposes the worst flaw of the robots.txt
protocol:
 a webmaster wishes to make all pages of a Web site, EXCEPT the home page
 (i.e. /), accessible to robots; how can she do this using the robots.txt
 protocol? The answer - She can't.
 [unquote]

 This is nonsense, what's happening here is that the Webmaster doesn't
 understand websites and has failed to distinguish between the default page
 and the home page.  If she wants to allow all pages except the homepage,
she
 can write a disallow line that explicitly excludes the home page using
it's
 full path.  She can then have the default page be a client-side
redirection
 to the home page, so that users other than robots will get the impression
 that the home page is the default page.

The '(i.e. /)' in the question was supposed to define the term home page
more precisely - it's what you call the default page.  So, to put the
question in your terms:

A webmaster wishes to make all pages of a Web site, EXCEPT the default page
(i.e. /), accessible to robots; how can she do this using the robots.txt
protocol?

With regards to your proposed solution, is the onus on Webmasters to
understand robots, or on robots to understand Webmasters?  In the light of
recent legal cases, I think it may be the latter...

There are problems with robots.txt and the robots meta tag.  We all know it.
I think there *could* be one standard that addressed the following issues:
link rot, crawling, caching/duplication, indexing and the law.  That
standard would be a worthy replacement to the current standards.

Regards

Alan Perkins, e-Brand Management Limited
http://www.ebrandmanagement.com/
White Paper in question: http://www.ebrandmanagement.com/whitepapers/




Re: robot developers?

2001-02-09 Thread Corey Wineman

  On the robots side, I'd be interested to know what techniques
 people are using to store URLs in the queue for later processing.
 (i.e., since folks want a delay between requests, it makes sense
 to have multiple input queues or use a database of some sort
 to store the URLs until they are needed for processing.)  Of course,
 the idea is to put them in a queue but to evenly distribute the output
 of the queue over the hosts being crawled so that the requests
 do not center on any one given host.
 -Art

I was wondering how other people are doing this too. Are most people
distributing the queue in different database tables, as Art says? Are people
creating their own file structures? Using a queue in memory?
Right now, I am using a single database table. Its not working that great,
especially a  the size of that table grows.  What types of databases are
people using? I am using ODBC, and can switch between different databases.
Even in the best cases, when my URL queue grows to around 100,000, inserting
new URLs becomes painfully slow, and I need to purge.
To distribute among hosts, I grab the next URL for processing by doing a
round-robin from a set of root URLs. It really doesn't guarantee that
consecutive connections to the same host wont be made yet, but I am working
on it.

-Corey




Re: robot developers?

2001-02-08 Thread (Lorraine Patsco)

 are people out-there developing (desing, programming) robots? 

I'm having one designed for my site according to my wishes.




Re: robot developers?

2001-02-08 Thread fastmedia

 I think it has to do with the fact that everytime someone engineers a
stroke
 of genius, they keep it to themselves. Everyone wants free hand outs, but
 few are willing to share their own findings.

is anybody interested in being able to grab urls out of flash movies?

i have some perl code i wrote that can skip over a movie's external linked
elements (like loaded layers) to provide a list of outbound hrefs (like
skip intro buttons)

i'd be happy to share it if someone might suggest how to better organise the
code.



--
chris paul
fastmedia.com




Re: robot developers?

2001-02-08 Thread ap296

 So, where is the robots brainstorming taking place?

If you find the answer to your question, please let us all know here. I'm
also disappointed at the lack of content on this list.

I think it has to do with the fact that everytime someone engineers a stroke
of genius, they keep it to themselves. Everyone wants free hand outs, but
few are willing to share their own findings.