Re: Screen scraper to get all 'a title' elements

Grobu Wed, 25 Nov 2015 15:37:37 -0800


On 25/11/15 23:48, ryguy7272 wrote:

re.findall( r'\<a[^>]+title="(.+?)"', html )

[ ... ]

Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's 
ok, I can see why it does that.


Can you just please explain what it's doing???

Yes it's a regular expression. Because RegEx's use the backslash as anescape character, it is advisable to use the "raw string" prefix (rbefore single/double/triple quote. To illustrate it with an example :

        >>> print "1\n2"
        1
        2
        >>> print r"1\n2"
        1\n2

As the backslash escape character is "neutralized" by the raw string,you can use the usual RegEx syntax at leisure :


\<a[^>]+title="(.+?)"

\<   was a mistake on my part, a single < is perfectly enough

[^>] is a class definition, and the caret (^) character indicatesnegation. Thus it means : any character other than >

+       incidates repetition : one or more of the previous element
.       will match just anything

.+" is a _greedy_ pattern that would match anything until it encountereda double quote

The problem with a greedy pattern is that it doesn't stop at the firstmatch. To illustrate :

>>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>>> a.group()
'"this is a test" class="test"'

It matches the first quote up to the last one.

On the other hand, you can use the "?" modifier to specify a non-greedypattern :


>>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
'"this is a test"'

It matches the first quote and stops looking for further matches afterthe second quote.


Finally, the parentheses are used to indicate a capture group :

>>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test"class="test"' )

>>> a.groups()
('is', 'test')

You can find detailed explanations about Python regular expressions atthis page : https://docs.python.org/2/howto/regex.html


HTH,

-Grobu-

--
https://mail.python.org/mailman/listinfo/python-list

Re: Screen scraper to get all 'a title' elements

Reply via email to