Parsing html with Beautifulsoup

Johann Spies Thu, 10 Dec 2009 01:18:00 -0800

I am trying to get csv-output from a html-file.

With this code I had a little success:
=========================
from BeautifulSoup import BeautifulSoup
from string import replace, join
import re


f = open("configuration.html","r")
g = open("configuration.csv",'w')
soup = BeautifulSoup(f)
t = soup.findAll('table')
for table in t:
    rows = table.findAll('tr')
    for th in rows[0]:
        t = th.find(text=True)
        g.write(t)
        g.write(',')
#        print(','.join(t))
        
    for tr in rows:
        cols = tr.findAll('td')
        for td in cols:
            try:
                t = td.find(text=True).replace('&nbsp;','')
                g.write(t)
            except:
                g.write ('')
            g.write(",")
        g.write("\n")
===============================

producing output like this:

RULE,SOURCE,DESTINATION,SERVICES,ACTION,TRACK,TIME,INSTALL ON,COMMENTS,
1,,,,drop,Log,Any,,,
2,All us...@any,,Any,clientencrypt,Log,Any,,,
3,Any,Any,,drop,None,Any,,,
4,,,,drop,None,Any,,,
...

It left out all the non-plaintext parts of <td></td>

I then tried using 

t.renderContents and then got something like this (one line broken into
many for the sake of this email):

1,<img src=icons/group.png>&nbsp;<a href=#OBJ_sunetint>
sunetint</A><BR>, 
<img src=icons/gateway_cluster.png>&nbsp;<a>href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster</A> <BR>,
<img>src=icons/udp.png>&nbsp;<a href=#SVC_IKE >IKE</a><br>,
<img src=icons/drop.png>&nbsp;drop,
<img src=icons/log.png>&nbsp;Log&nbsp;,
<img src=icons/any.png>&nbsp;Any<br>&nbsp;,
<img src=icons/gateway_cluster.png>&nbsp;<a href=#OBJ_Rainwall_Cluster
>Rainwall_Cluster</A> <BR>&nbsp;,&nbsp;

How do I get Beautifulsoup to render (taking the above line as
example)

sunentint for <img src=icons/group.png>&nbsp;<a
href=#OBJ_sunetint>sunetint</A><BR>

and still provide the text-parts in the <td>'s with plain text?

I have experimented a little bit with regular expressions, but could
so far not find a solution.

Regards
Johann
-- 
Johann Spies          Telefoon: 021-808 4599
Informasietegnologie, Universiteit van Stellenbosch

     "Lo, children are an heritage of the LORD: and the  
      fruit of the womb is his reward."        Psalms 127:3 
-- 
http://mail.python.org/mailman/listinfo/python-list

Parsing html with Beautifulsoup

Reply via email to