I like to use sgmlib for parsing html when i can. Its nice and clean.
basically you define functions such as start_<tagname> and end_<tagname> as
well as handle_data. There are some other nifty functions such as the
default handling of tags. I highly recommend checking it out.
an example that may or may not work is as follows:
import sgmllib
class foo(sgmllib.SGMLParser):
def __init__(self, verbose=0):
sgmllib.SGMLParser.__init__(self, verbose)
self.current_tag = "'
def start_description(self, attrs):
self.current_tag = "description"
def end_description(self):
self.current_tag = ""
def handle_data(self, data):
if self.current_tag == "description:
print data
bar = foo()
foo.fead(somefile)
foo.close()
On 12/12/06, James G. Sack (jim) <[EMAIL PROTECTED]> wrote:
Andrew Lentvorski wrote:
> Todd Walton wrote:
>
>> So, the script runs through the text file line by line, until it finds
>> the opening description tag and then, starting with the next line,
>> writes it all out to a new file until it comes to the end-description
>> tag. Same for the other two. Will this work? If the blocks are out
>> of order in the datafile will this still work?
>
> Possibly, but you're making an awful lot of work for yourself and it
> will be brittle if you need add or subtract sections with time.
>
>> Should I change something?
>
> Yes, this is the kind of thing that XML was actually made for.
>
> Since you are already using "HTML-style" tags, I heartily recommend that
> you add just enough extra structure so that you can let any of the
> myriad XML DOM bindings just suck the whole file in and then work on it.
>
> The magic keywords in Python are probably pulldom and/or elementtree.
> I'm sure Perl has something similar.
>
Possible additional useful stuff:
fix broken html/xml: elementtidy (*very nice*, based on html tidy prog)
http://effbot.org/zone/element-tidylib.htm
from xml.sax.saxutils import XMLFilterBase, XMLGenerator
http://www-128.ibm.com/developerworks/xml/library/x-tipsaxflex.html
Uche has some good examples in his writings (esp: those on xml.com)
http://uche.ogbuji.net/tech/akara/pyxml/
(ps: the 4Suite package from his company transparently
extends standard python xml stuff)
Possibly also related:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/84515
http://effbot.org/zone/element-index.htm
Of course it all may be overkill if you're just going to do this once or
twice.
Regards,
..jim
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list
--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list