Re: [Robots] robot in python?

2003-11-19 Thread Petter Karlström
SsolSsinclair wrote:

Walter:
You may find that the threads and exceptions in Python more than
make up for anything you are missing in Perl. The Python libraries
are not as extensive, but that is mostly because they have one of
everything instead of five or six of everything.
SsolSsinclair:
this conclusional statement [above] comes from Walter which outlines the coding
advantages of using Python. A person capable of inventing these statements on the 
spot would
know them to be true. I am unclear, therefore, why Portions of Verity UltraSeek [a 
commercial
product] would need to use C or C++ modules.
Well, Walter compared Python and Perl not Python and C or C++. I can see 
why portions of a bot would be written in C or C++. Performance issues 
would perhaps not be too wild a guess.



.I am unclear and have an uninformed opinion about the intent of "finding" web-pages in
which the original author did not wish to b e found. i, cannot perceive there 
being any
possibility these pages to be of any interest to the general public, or to the 
corporate
citizenship. However, this statement by Walter: This tends to overestimate the 
links (e.g.,
pulling out references in comments, etc.), and often yields fragments that are not 
really
followable, but it is at least a possibility. seems to indicate there is a 
difference
between the # o9f pages retrieved, and the "possible number of pages that could be 
retrieved".
This numerical difference is QUITE noteable, and of interest to competing coders. It 
is,
however, A privately held #. Thanks Alex.
Sorry, but what you're discussing is a quite different matter than the 
discussion you're quoting. Overestimation by for example regexps is just 
that the bot may, for example, mistakingly store some things as tags 
that really aren't. Overestimating what you may need in more general 
terms is a different (alebeit interesting) matter.


Primary concern of Petter::Still wunder how to handle logins, though...

This is really your determination to make. Taking anyone's opinion on the matter 
would end
in your system being less secure. I would guess some method of encryptian. I am not 
really clear
on why the need of a PassWrdMgr is necessary with the development of a Search Engine, 
crawler.
This system should really already be in place in your work environment. Maybe even a 
firewall?
The reason I need some password management is that my app has to login 
to a secure site. Encryption would certainly be nice!

/Petter

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-19 Thread SsolSsinclair
BUSH CALLS ON SENATE TO RATIFY CYBERCRIME TREATY
President Bush has asked the US Senate to ratify the first
international cybercrime treaty.  Bush called the Council of
Europe's controversial treaty "an effective tool in the
global effort to combat computer-related crime" and "the
only multilateral treaty to address the problems of
computer-related crime and electronic evidence gathering."
http://news.com.com/2100-1028_3-5108854.html

this comment looks to effect the coding environment. Appreciate comments on 
pertinence. I don't
think Bush has time, however, to spend developing code, however.


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


RE: [Robots] robot in python?

2003-11-19 Thread SsolSsinclair
developing a crawler for a search engine with Python
==

You may find that the threads and exceptions in Python more than
make up for anything you are missing in Perl. The Python libraries
are not as extensive, but that is mostly because they have one of
everything instead of five or six of everything.

Extracting links using a regular HTML parser works fine, and isn't
that much work. One of the major issues in an HTML parser is
dealing with all the illegal HTML on the web.

this conclusional statement [above] comes from Walter which outlines the coding
advantages of using Python. A person capable of inventing these statements on the spot 
would
know them to be true. I am unclear, therefore, why Portions of Verity UltraSeek [a 
commercial
product] would need to use C or C++ modules.

Has anybody here written a webbot in Python?

Answer
Verity Ultraseek is a web crawler and search engine written in
Python. Portions of it are C or C++ native modules. Ultraseek
is a commercial product, so we don't give out the code. Sorry.

from: Alexander Halavais
It really depends on what you are looking for, and how tolerant of
errors you are. For most of what I do, I use the HTML parser, but I have
also done simple expression matching to pull out links. This tends to
overestimate the links (e.g., pulling out references in comments, etc.),
and often yields fragments that are not really followable, but it is at
least a possibility.

.I am unclear and have an uninformed opinion about the intent of "finding" 
web-pages in
which the original author did not wish to b e found. i, cannot perceive there being any
possibility these pages to be of any interest to the general public, or to the 
corporate
citizenship. However, this statement by Walter: This tends to overestimate the 
links (e.g.,
pulling out references in comments, etc.), and often yields fragments that are not 
really
followable, but it is at least a possibility. seems to indicate there is a 
difference
between the # o9f pages retrieved, and the "possible number of pages that could be 
retrieved".
This numerical difference is QUITE noteable, and of interest to competing coders. It 
is,
however, A privately held #. Thanks Alex.

Specifically, I need something like the linkextor available in Perl.

petter wrote::
Yes, in fact I found some very good examples on the website "Dive Into
Python", including how to do a linkextor. Quite simple.
http://diveintopython.org/html_processing/extracting_data.html This uses
SGMLParser which presumably is more tolerant on illegal HTML.


Primary concern of Petter::Still wunder how to handle logins, though...

This is really your determination to make. Taking anyone's opinion on the matter 
would end
in your system being less secure. I would guess some method of encryptian. I am not 
really clear
on why the need of a PassWrdMgr is necessary with the development of a Search Engine, 
crawler.
This system should really already be in place in your work environment. Maybe even a 
firewall?

.Ssol>.
Digital Acquizitionatory Inventory Drive Imaging

>Sol


___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots


Re: [Robots] robot in python?

2003-11-19 Thread Petter Karlström
Walter Underwood wrote:


Python is of course a smaller language, so the libraries aren't as
extensive as the Perl counterparts. Also, I find the documentation
somewhat lacking (or it could be me being new to the language).


You may find that the threads and exceptions in Python more than
make up for anything you are missing in Perl. The Python libraries
are not as extensive, but that is mostly because they have one of
everything instead of five or six of everything.
Yup, that's why I'm learning Python! I got tired of the "after the fact" 
object orientation and the sometimes maddening syntax of Perl.

Extracting links using a regular HTML parser works fine, and isn't
that much work. One of the major issues in an HTML parser is
dealing with all the illegal HTML on the web.
Yes, in fact I found some very good examples on the website "Dive Into 
Python", including how to do a linkextor. Quite simple. 
http://diveintopython.org/html_processing/extracting_data.html This uses 
SGMLParser which presumably is more tolerant on illegal HTML.

Still wonder how to handle logins, though...

/petter

___
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots