I like to use sgmlib for parsing html when i can. Its nice and clean.
basically you define functions such as start_<tagname> and end_<tagname> as
well as handle_data. There are some other nifty functions such as the
default handling of tags. I highly recommend checking it out.

an example that may or may not work is as follows:

import sgmllib

class foo(sgmllib.SGMLParser):
 def __init__(self, verbose=0):
   sgmllib.SGMLParser.__init__(self, verbose)
   self.current_tag = "'

 def start_description(self, attrs):
     self.current_tag = "description"

 def end_description(self):
     self.current_tag = ""

 def handle_data(self, data):
     if self.current_tag == "description:
         print data

bar = foo()
foo.fead(somefile)
foo.close()


On 12/12/06, James G. Sack (jim) <[EMAIL PROTECTED]> wrote:

Andrew Lentvorski wrote:
> Todd Walton wrote:
>
>> So, the script runs through the text file line by line, until it finds
>> the opening description tag and then, starting with the next line,
>> writes it all out to a new file until it comes to the end-description
>> tag.  Same for the other two.  Will this work?  If the blocks are out
>> of order in the datafile will this still work?
>
> Possibly, but you're making an awful lot of work for yourself and it
> will be brittle if you need add or subtract sections with time.
>
>> Should I change something?
>
> Yes, this is the kind of thing that XML was actually made for.
>
> Since you are already using "HTML-style" tags, I heartily recommend that
> you add just enough extra structure so that you can let any of the
> myriad XML DOM bindings just suck the whole file in and then work on it.
>
> The magic keywords in Python are probably pulldom and/or elementtree.
> I'm sure Perl has something similar.
>

Possible additional useful stuff:
fix broken html/xml: elementtidy (*very nice*, based on html tidy prog)
   http://effbot.org/zone/element-tidylib.htm
from xml.sax.saxutils import XMLFilterBase, XMLGenerator
   http://www-128.ibm.com/developerworks/xml/library/x-tipsaxflex.html

Uche has some good examples in his writings (esp: those on xml.com)
  http://uche.ogbuji.net/tech/akara/pyxml/
  (ps: the 4Suite package from his company transparently
       extends standard python xml stuff)

Possibly also related:
  http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/84515
  http://effbot.org/zone/element-index.htm


Of course it all may be overkill if you're just going to do this once or
twice.

Regards,
..jim


--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list


--
[email protected]
http://www.kernel-panic.org/cgi-bin/mailman/listinfo/kplug-list

Reply via email to