Petter Karlström wrote:

Hello all,

Nice to see that this list woke up again! :^)

And now the list owner finally woke up, too... I hadn't noticed the recent traffic on the list until just now. Are those messages about an address no longer in use going to the whole list? Aghh. I've taken care of that, I hope, but the source address wasn't actually subscribed, so I had to guess.

Back to the point at hand... I've written several specialized robots in
Python over the last few years.  They are specifically for crawling
on-line discussions and parsing out individual messages and meta-data.

Look for Aahz's examples (do a Google search on Aahz and Python, I'm
sure that'll lead you there).  He makes multi-threading for your spider
pretty easy and adaptable to various kinds of processing.

I have written crawlers in Perl before, but I wish to try out Python for
a hobby project. Has anybody here written a webbot i Python?

Python is of course a smaller language, so the libraries aren't as
extensive as the Perl counterparts. Also, I find the documentation
somewhat lacking (or it could be me being new to the language).

After switching from Perl to Python a couple of years ago, I haven't ever found the Python libraries lacking, although I expected to. Documentation, in the form of published books, has been a bit scarce, but new ones have been coming out lately. I just looked through one on text applications in Python, but haven't bought it yet. It definitely looked good.

Are there any small examples available on use of HTMLParser and htmllib?
Specifically, I need something like the linkextor available in Perl.

One trick is to search on "import [modulename]" as a phrase. That'll often uncover code you can use as an example. What does linkextor do? Link extractor? If so, I just use regular expressions.

Also, what is the neatest way to store session data like login and password? PassWordMgr?

Store in what sense?


I'll take a look at my code and see if I can share something generic.
Since we're doing www.opensector.org, I suppose it would only be right
for us to share at least *some* of our code!

However... I just looked at what I have and the older stuff doesn't
really add much to Aahz's examples, other than some simple use of MySQL
as the store; my newer stuff is far too specific to the task I'm doing
to be able to quickly "sanitize" it.

The main thing I did to address our specific needs was to create a Java
class for message pages in specific types of web-based discussion
forums.  That's partly to extract URLs, but mostly to extract other
features and to intelligently (in the sense of being able to update my
database rapidly, re-visiting the minimum number of pages) navigate the
threading structures, which work in various ways.  The class for
Jive-based forums is only 225 lines, as an example.  The multi-threaded
module that uses it is 100 lines; a single-threaded version is 25 lines.

We also have a Python robot for NNTP servers, which obviously doesn't
need recursion.  It's about 400 lines.  A lot of it deals with things
like missing messages, zeroing in on desired date ranges, avoiding
downloading huge messages, recovery from failure, etc.

All of these talk to MySQL...

Nick

--
Nick Arnett
Phone/fax: (408) 904-7198
[EMAIL PROTECTED]

_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to