On 25/11/15 23:48, ryguy7272 wrote:
re.findall( r'\<a[^>]+title="(.+?)"', html )
[ ... ]
Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's 
ok, I can see why it does that.

Can you just please explain what it's doing???


Yes it's a regular expression. Because RegEx's use the backslash as an escape character, it is advisable to use the "raw string" prefix (r before single/double/triple quote. To illustrate it with an example :
        >>> print "1\n2"
        1
        2
        >>> print r"1\n2"
        1\n2
As the backslash escape character is "neutralized" by the raw string, you can use the usual RegEx syntax at leisure :

\<a[^>]+title="(.+?)"

\<   was a mistake on my part, a single < is perfectly enough
[^>] is a class definition, and the caret (^) character indicates negation. Thus it means : any character other than >
+       incidates repetition : one or more of the previous element
.       will match just anything
.+" is a _greedy_ pattern that would match anything until it encountered a double quote

The problem with a greedy pattern is that it doesn't stop at the first match. To illustrate :
>>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>>> a.group()
'"this is a test" class="test"'

It matches the first quote up to the last one.
On the other hand, you can use the "?" modifier to specify a non-greedy pattern :

>>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
'"this is a test"'

It matches the first quote and stops looking for further matches after the second quote.

Finally, the parentheses are used to indicate a capture group :
>>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" class="test"' )
>>> a.groups()
('is', 'test')


You can find detailed explanations about Python regular expressions at this page : https://docs.python.org/2/howto/regex.html

HTH,

-Grobu-

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to