On Feb 11, 6:05 pm, Ayaz Ahmed Khan <[EMAIL PROTECTED]> wrote: > "mtuller" typed: > > > I have also tried Beautiful Soup, but had trouble understanding the > > documentation > > As Gabriel has suggested, spend a little more time going through the > documentation of BeautifulSoup. It is pretty easy to grasp. > > I'll give you an example: I want to extract the text between the > following span tags in a large HTML source file. > > <span class="title">Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow > Vulnerability</span> > > >>> import re > >>> from BeautifulSoup import BeautifulSoup > >>> from urllib2 import urlopen > >>> soup = BeautifulSoup(urlopen('http://www.someurl.tld/')) > >>> title = soup.find(name='span', attrs={'class':'title'}, > >>> text=re.compile(r'^Linux \w+')) > >>> title > > u'Linux Kernel Bluetooth CAPI Packet Remote Buffer Overflow Vulnerability' >
One can even use ElementTree, if the HTML is well-formed. See below. However if it is as ill-formed as the sample (4th "td" element not closed; I've omitted it below), then the OP would be better off sticking with Beautiful Soup :-) C:\junk>type element_soup.py from xml.etree import cElementTree as ET import cStringIO guff = """ <tr > <td headers="col1_1" style="width:21%" > <span class="hpPageText" >LETTER</span></td> <td headers="col2_1" style="width:13%; text-align:right" > <span class="hpPageText" >33,699</span></td> <td headers="col3_1" style="width:13%; text-align:right" > <span class="hpPageText" >1.0</span></td> </tr> """ tree = ET.parse(cStringIO.StringIO(guff)) for elem in tree.getiterator('td'): key = elem.get('headers') assert elem[0].tag == 'span' value = elem[0].text print repr(key), repr(value) C:\junk>\python25\python element_soup.py 'col1_1' 'LETTER' 'col2_1' '33,699' 'col3_1' '1.0' HTH, John -- http://mail.python.org/mailman/listinfo/python-list