Re: robot developers?
The white paper contains the very strange statement [quote] Finally, a question that exposes the worst flaw of the robots.txt protocol: a webmaster wishes to make all pages of a Web site, EXCEPT the home page (i.e. /), accessible to robots; how can she do this using the robots.txt protocol? The answer - She can't. [unquote] This is nonsense, what's happening here is that the Webmaster doesn't understand websites and has failed to distinguish between the default page and the home page. If she wants to allow all pages except the homepage, she can write a disallow line that explicitly excludes the home page using it's full path. She can then have the default page be a client-side redirection to the home page, so that users other than robots will get the impression that the home page is the default page. No fault with the robots.txt protocol here. Tom Thomson -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Alan Perkins Sent: 09 February 2001 00:58 To: [EMAIL PROTECTED] Subject: Re: robot developers? Here's something for discussion: The robots.txt and robots meta tag protocols have serious flaws but there appears to be no concerted effort to fix them. We've published a white paper discussing the flaws in detail at www.ebrandmanagement.com/whitepapers/ Anyone on this list is welcome to read the white paper and start a discussion about it. You will need the following user name and password: User name: [EMAIL PROTECTED] Password: [EMAIL PROTECTED] Enjoy... Alan Perkins, Chief Technology Officer e-Brand Management Limited
Re: robot developers?
Tom [quote] Finally, a question that exposes the worst flaw of the robots.txt protocol: a webmaster wishes to make all pages of a Web site, EXCEPT the home page (i.e. /), accessible to robots; how can she do this using the robots.txt protocol? The answer - She can't. [unquote] This is nonsense, what's happening here is that the Webmaster doesn't understand websites and has failed to distinguish between the default page and the home page. If she wants to allow all pages except the homepage, she can write a disallow line that explicitly excludes the home page using it's full path. She can then have the default page be a client-side redirection to the home page, so that users other than robots will get the impression that the home page is the default page. The '(i.e. /)' in the question was supposed to define the term home page more precisely - it's what you call the default page. So, to put the question in your terms: A webmaster wishes to make all pages of a Web site, EXCEPT the default page (i.e. /), accessible to robots; how can she do this using the robots.txt protocol? With regards to your proposed solution, is the onus on Webmasters to understand robots, or on robots to understand Webmasters? In the light of recent legal cases, I think it may be the latter... There are problems with robots.txt and the robots meta tag. We all know it. I think there *could* be one standard that addressed the following issues: link rot, crawling, caching/duplication, indexing and the law. That standard would be a worthy replacement to the current standards. Regards Alan Perkins, e-Brand Management Limited http://www.ebrandmanagement.com/ White Paper in question: http://www.ebrandmanagement.com/whitepapers/
Re: robot developers?
On the robots side, I'd be interested to know what techniques people are using to store URLs in the queue for later processing. (i.e., since folks want a delay between requests, it makes sense to have multiple input queues or use a database of some sort to store the URLs until they are needed for processing.) Of course, the idea is to put them in a queue but to evenly distribute the output of the queue over the hosts being crawled so that the requests do not center on any one given host. -Art I was wondering how other people are doing this too. Are most people distributing the queue in different database tables, as Art says? Are people creating their own file structures? Using a queue in memory? Right now, I am using a single database table. Its not working that great, especially a the size of that table grows. What types of databases are people using? I am using ODBC, and can switch between different databases. Even in the best cases, when my URL queue grows to around 100,000, inserting new URLs becomes painfully slow, and I need to purge. To distribute among hosts, I grab the next URL for processing by doing a round-robin from a set of root URLs. It really doesn't guarantee that consecutive connections to the same host wont be made yet, but I am working on it. -Corey
Re: robot developers?
are people out-there developing (desing, programming) robots? I'm having one designed for my site according to my wishes.
Re: robot developers?
I think it has to do with the fact that everytime someone engineers a stroke of genius, they keep it to themselves. Everyone wants free hand outs, but few are willing to share their own findings. is anybody interested in being able to grab urls out of flash movies? i have some perl code i wrote that can skip over a movie's external linked elements (like loaded layers) to provide a list of outbound hrefs (like skip intro buttons) i'd be happy to share it if someone might suggest how to better organise the code. -- chris paul fastmedia.com
Re: robot developers?
So, where is the robots brainstorming taking place? If you find the answer to your question, please let us all know here. I'm also disappointed at the lack of content on this list. I think it has to do with the fact that everytime someone engineers a stroke of genius, they keep it to themselves. Everyone wants free hand outs, but few are willing to share their own findings.