developing a crawler for a search engine with Python
==========================================

You may find that the threads and exceptions in Python more than
make up for anything you are missing in Perl. The Python libraries
are not as extensive, but that is mostly because they have one of
everything instead of five or six of everything.

Extracting links using a regular HTML parser works fine, and isn't
that much work. One of the major issues in an HTML parser is
dealing with all the illegal HTML on the web.

>>>>>>>>this conclusional statement [above] comes from Walter which outlines the coding
advantages of using Python. A person capable of inventing these statements on the spot 
would
know them to be true. I am unclear, therefore, why Portions of Verity UltraSeek [a 
commercial
product] would need to use C or C++ modules.

Has anybody here written a webbot in Python?

Answer
Verity Ultraseek is a web crawler and search engine written in
Python. Portions of it are C or C++ native modules. Ultraseek
is a commercial product, so we don't give out the code. Sorry.

from: Alexander Halavais
It really depends on what you are looking for, and how tolerant of
errors you are. For most of what I do, I use the HTML parser, but I have
also done simple expression matching to pull out links. This tends to
overestimate the links (e.g., pulling out references in comments, etc.),
and often yields fragments that are not really followable, but it is at
least a possibility.

>>>>>>>>.I am unclear and have an uninformed opinion about the intent of "finding" 
>>>>>>>>web-pages in
which the original author did not wish to b e found. i, cannot perceive there being any
possibility these pages to be of any interest to the general public, or to the 
corporate
citizenship. However, this statement by Walter: <Q>This tends to overestimate the 
links (e.g.,
pulling out references in comments, etc.), and often yields fragments that are not 
really
followable, but it is at least a possibility.</Q> seems to indicate there is a 
difference
between the # o9f pages retrieved, and the "possible number of pages that could be 
retrieved".
This numerical difference is QUITE noteable, and of interest to competing coders. It 
is,
however, A privately held #. Thanks Alex.

Specifically, I need something like the linkextor available in Perl.

petter wrote::
Yes, in fact I found some very good examples on the website "Dive Into
Python", including how to do a linkextor. Quite simple.
http://diveintopython.org/html_processing/extracting_data.html This uses
SGMLParser which presumably is more tolerant on illegal HTML.


Primary concern of Petter::<Q>Still wunder how to handle logins, though...</Q>

<Me>This is really your determination to make. Taking anyone's opinion on the matter 
would end
in your system being less secure. I would guess some method of encryptian. I am not 
really clear
on why the need of a PassWrdMgr is necessary with the development of a Search Engine, 
crawler.
This system should really already be in place in your work environment. Maybe even a 
firewall?

>>>>>>>>>>>>>>>>>>>>.Ssol>>>>>>>>>>>>>.
Digital Acquizitionatory Inventory Drive Imaging

>Sol


_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to