Re: [Robots] Is this mailing linst alive?
On Nov 3, 2003, at 11:16 PM, Nick Arnett wrote: [EMAIL PROTECTED] wrote: I've created a robot, www.dead-links.com and i wonder if this list is alive. It is alive, but very, very quiet. Yeah, this robots thing is just a fad, it'll never catch on. -Tim ___ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
[Robots] Re: better language for writing a Spider ?
Sean M. Burke wrote: In short, if people want to see improvements to LWP, email me and say what you want done For robots, you need a call that says fetch this URL, but get a maximum of XX bytes and spend a maximum of YY seconds doing it. Return status should tell you whether it finished or timed out, and how many bytes were actually retrieved. BTW, have the LWP timeouts been fixed? As recently as early 2000, they were known to generally not work. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: matching and UserAgent: in robots.txt
Sean M. Burke wrote: I'm a bit perplexed over whether the current Perl library WWW::RobotRules implements a certain part of the Robots Exclusion Standard correctly. So forgive me if this seems a simple question, but my reading of the Robots Exclusion Standard hasn't really cleared it up in my mind yet. Is this the REP stuff out of LWP? My opinion, based on having used it in a BG robot and not getting flamed, is that the LWP implementation of Robot Exclusion is as close to 100% right as you're going to get. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
At 09:47 AM 14/03/02 -0800, srinivas mohan wrote: Now as the performance is low..we wanted to redevelop our spider..in a language like c or perl...and use it with our existing product.. I will be thankful if any one can help me choosing the better language..where i can get better performance.. You'll never get better performance until you understand why you had lousy performance before. It's not obvious to me why Java should get in the way. I've written two very large robots and used perl both times. There were two good reasons to choose perl: - A robot fetches pages, analyzes them, and manages a database of been-processed and to-process. The fetching involves no CPU. The database is probably the same in whatever language you use. THus the leftover computation is picking apart pages looking for URLs and BASE values and so on... perl is hard to beat for that type of code. - Time-to-market was criticial. Using perl means you have to write much less code than in java or C or whatever, so you get done quicker. It's not clear that you can write a robot to run faster than a well-done perl one. It is clear you can write one that's much more maintainable, perl makes it too easy to write obfuscated code. Another disadvantage of perl is the large memory footprint - since a robot needs to be highly parallel, you probably can't afford to have a perl process per execution thread. Next time I might go with python. Its regexp engine isn't quite as fast, but the maintainability is better. -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].
[Robots] Re: better language for writing a Spider ?
At 10:36 AM 14/03/02 -0800, Nick Arnett wrote: I wish I could be more specific, but I never did figure out what was really going on. Following an LWP request through the debugger is a long and convoluted journey... I totally agree with Nick that when LWP works, it's OK, but when it doesn't, debugging is beyond the scope of mere mortals. ANd it just doesn't do timeouts or input throttling. I tried to get it to do timeouts, it didn't, I went and found the appropriate discussion group and half the messages were having trouble with timeouts... mind you that was early 2000, maybe things have improved? -Tim -- This message was sent by the Internet robots and spiders discussion list ([EMAIL PROTECTED]). For list server commands, send help in the body of a message to [EMAIL PROTECTED].