Re: how can i use lxml with win32com?

Michiel Overtoom Sun, 25 Oct 2009 01:15:42 -0700

elca wrote:

yes i want to extract this text 'CNN Shop' and linked page
'http://www.turnerstoreonline.com'.


Well then.
First, we'll get the page using urrlib2:

    doc=urllib2.urlopen("http://www.cnn.com";)

Then we'll feed it into the HTML parser:

    soup=BeautifulSoup(doc)

Next, we'll look at all the links in the page:

    for a in soup.findAll("a"):

and when a link has the text 'CNN Shop', we have a hit,
and print the URL:

        if a.renderContents()=="CNN Shop":
            print a["href"]


The complete program is thus:

import urllib2
from BeautifulSoup import BeautifulSoup

doc=urllib2.urlopen("http://www.cnn.com";)
soup=BeautifulSoup(doc)
for a in soup.findAll("a"):
    if a.renderContents()=="CNN Shop":
        print a["href"]

The example above can be condensed because BeautifulSoup's find functioncan also look for texts:


    print soup.find("a",text="CNN Shop")

and since that's a navigable string, we can ascend to its parent anddisplay the href attribute:


    print soup.find("a",text="CNN Shop").findParent()["href"]

So eventually the whole program could be collapsed into one line:

printBeautifulSoup(urllib2.urlopen("http://www.cnn.com";)).find("a",text="CNNShop").findParent()["href"]


...but I think this is very ugly!


> im very sorry my english.

You English is quite understandable. The hard part is figuring out whatexactly you wanted to achieve ;-)

I have a question too. Why did you think JavaScript was necessary toarrive at this result?


Greetings,
--
http://mail.python.org/mailman/listinfo/python-list

Re: how can i use lxml with win32com?

Reply via email to