RE: webspider, regexp not working, why?

Reedick, Andrew Fri, 23 May 2008 10:32:12 -0700


> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:python-
> [EMAIL PROTECTED] On Behalf Of
> [EMAIL PROTECTED]
> Sent: Friday, May 23, 2008 12:43 PM
> To: [email protected]
> Subject: webspider, regexp not working, why?
> 
> url = re.compile(r"^((ht|f)tp(s?)\:\/\/|~/|/)?([\w]+:\w+@)?([a-zA-Z]
> 
> search and match yields the same results.
> 
> but when you put something like href= in front of it it doesnt work.



a)  '^' matches at the beginning of a line.  So if 'href=' is at the
beginning of the line...

b)  Regexes are hard enough to read as is.  (http|ftp|https) is more
readable than ((ht|f)tp(s?).

c)  If you're going to parse html/xml then bite the bullet and learn one
of the libraries specifically designed to parse html/xml.  Many other
regex gurus have learned this lesson.  Myself included.  =)



*****

The information transmitted is intended only for the person or entity to which 
it is addressed and may contain confidential, proprietary, and/or privileged 
material. Any review, retransmission, dissemination or other use of, or taking 
of any action in reliance upon this information by persons or entities other 
than the intended recipient is prohibited. If you received this in error, 
please contact the sender and delete the material from all computers. GA621


--
http://mail.python.org/mailman/listinfo/python-list

RE: webspider, regexp not working, why?

Reply via email to