developing a crawler for a search engine with Python ==========================================
You may find that the threads and exceptions in Python more than make up for anything you are missing in Perl. The Python libraries are not as extensive, but that is mostly because they have one of everything instead of five or six of everything. Extracting links using a regular HTML parser works fine, and isn't that much work. One of the major issues in an HTML parser is dealing with all the illegal HTML on the web. >>>>>>>>this conclusional statement [above] comes from Walter which outlines the coding advantages of using Python. A person capable of inventing these statements on the spot would know them to be true. I am unclear, therefore, why Portions of Verity UltraSeek [a commercial product] would need to use C or C++ modules. Has anybody here written a webbot in Python? Answer Verity Ultraseek is a web crawler and search engine written in Python. Portions of it are C or C++ native modules. Ultraseek is a commercial product, so we don't give out the code. Sorry. from: Alexander Halavais It really depends on what you are looking for, and how tolerant of errors you are. For most of what I do, I use the HTML parser, but I have also done simple expression matching to pull out links. This tends to overestimate the links (e.g., pulling out references in comments, etc.), and often yields fragments that are not really followable, but it is at least a possibility. >>>>>>>>.I am unclear and have an uninformed opinion about the intent of "finding" >>>>>>>>web-pages in which the original author did not wish to b e found. i, cannot perceive there being any possibility these pages to be of any interest to the general public, or to the corporate citizenship. However, this statement by Walter: <Q>This tends to overestimate the links (e.g., pulling out references in comments, etc.), and often yields fragments that are not really followable, but it is at least a possibility.</Q> seems to indicate there is a difference between the # o9f pages retrieved, and the "possible number of pages that could be retrieved". This numerical difference is QUITE noteable, and of interest to competing coders. It is, however, A privately held #. Thanks Alex. Specifically, I need something like the linkextor available in Perl. petter wrote:: Yes, in fact I found some very good examples on the website "Dive Into Python", including how to do a linkextor. Quite simple. http://diveintopython.org/html_processing/extracting_data.html This uses SGMLParser which presumably is more tolerant on illegal HTML. Primary concern of Petter::<Q>Still wunder how to handle logins, though...</Q> <Me>This is really your determination to make. Taking anyone's opinion on the matter would end in your system being less secure. I would guess some method of encryptian. I am not really clear on why the need of a PassWrdMgr is necessary with the development of a Search Engine, crawler. This system should really already be in place in your work environment. Maybe even a firewall? >>>>>>>>>>>>>>>>>>>>.Ssol>>>>>>>>>>>>>. Digital Acquizitionatory Inventory Drive Imaging >Sol _______________________________________________ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots