Re: [Robots] Is this mailing linst alive?

2003-11-04 Thread Tim Bray
On Nov 3, 2003, at 11:16 PM, Nick Arnett wrote:

[EMAIL PROTECTED] wrote:

I've created a robot, www.dead-links.com and i wonder if this list is 
alive.
It is alive, but very, very quiet.
Yeah, this robots thing is just a fad, it'll never catch on. -Tim

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


[Robots] Re: better language for writing a Spider ?

2002-03-15 Thread Tim Bray


Sean M. Burke wrote:

 In short, if people want to see improvements to LWP, email me and say what 
 you want done


For robots, you need a call that says fetch this URL, but get a maximum
of XX bytes and spend a maximum of YY seconds doing it.  Return status
should tell you whether it finished or timed out, and how many bytes
were actually retrieved.

BTW, have the LWP timeouts been fixed?  As recently as early 2000, they
were known to generally not work.  -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: matching and UserAgent: in robots.txt

2002-03-14 Thread Tim Bray


Sean M. Burke wrote:

 I'm a bit perplexed over whether the current Perl library WWW::RobotRules 
 implements a certain part of the Robots Exclusion Standard correctly.  So 
 forgive me if this seems a simple question, but my reading of the Robots 
 Exclusion Standard hasn't really cleared it up in my mind yet.


Is this the REP stuff out of LWP?  My opinion, based on having used it
in a BG robot and not getting flamed, is that the LWP
implementation of Robot Exclusion is as close to 100% right as you're
going to get. -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Tim Bray


At 09:47 AM 14/03/02 -0800, srinivas mohan wrote:
Now as the performance is  low..we wanted to redevelop
our spider..in a language like c or perl...and use
it with our existing product..

I will be thankful if any one can help me choosing 
the better language..where i can get better
performance..

You'll never get better performance until you understand why you
had lousy performance before.  It's not obvious to me why Java
should get in the way.

I've written two very large robots and used perl both times.
There were two good reasons to choose perl:

- A robot fetches pages, analyzes them, and manages a database
  of been-processed and to-process.  The fetching involves no CPU.
  The database is probably the same in whatever language you use.
  THus the leftover computation is picking apart pages looking 
  for URLs and BASE values and so on... perl is hard to beat
  for that type of code.
- Time-to-market was criticial.  Using perl means you have to write
  much less code than in java or C or whatever, so you get done
  quicker.

It's not clear that you can write a robot to run faster than a
well-done perl one.  It is clear you can write one that's much
more maintainable, perl makes it too easy to write obfuscated code.
Another disadvantage of perl is the large memory footprint - since
a robot needs to be highly parallel, you probably can't afford to
have a perl process per execution thread.

Next time I might go with python.  Its regexp engine isn't quite 
as fast, but the maintainability is better.  -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].



[Robots] Re: better language for writing a Spider ?

2002-03-14 Thread Tim Bray


At 10:36 AM 14/03/02 -0800, Nick Arnett wrote:

  I wish
I could be more specific, but I never did figure out what was really going
on.  Following an LWP request through the debugger is a long and convoluted
journey...

I totally agree with Nick that when LWP works, it's OK, but when
it doesn't, debugging is beyond the scope of mere mortals.  ANd
it just doesn't do timeouts or input throttling.  I tried to
get it to do timeouts, it didn't, I went and found the appropriate
discussion group and half the messages were having trouble with
timeouts... mind you that was early 2000, maybe things have
improved? -Tim


--
This message was sent by the Internet robots and spiders discussion list 
([EMAIL PROTECTED]).  For list server commands, send help in the body of a message 
to [EMAIL PROTECTED].