from HTMLParser import HTMLParser class MyHTMLParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.TokenList = [] def handle_data( self,data): data = data.strip() if data and len(data) > 0: self.TokenList.append(data) #print data def GetTokenList(self): return self.TokenList
try: url = "http://....your url here.............." f = urllib.urlopen(url) res = f.read() f.close() except: print "bad read" return h = MyHTMLParser() h.feed(res) tokensList = h.GetTokenList() Kenneth McDonald wrote: > I'm writing a program that will parse HTML and (mostly) convert it to > MediaWiki format. The two Python modules I'm aware of to do this are > HTMLParser and htmllib. However, I'm currently experiencing either real > or conceptual difficulty with both, and was wondering if I could get > some advice. > > The problem I'm having with HTMLParser is simple; I don't seem to be > getting the actual text in the HTML document. I've implemented the > do_data method of HTMLParser.HTMLParser in my HTMLParser subclass, but > it never seems to receive any data. Is there another way to access the > text chunks as they come along? > > HTMLParser would probably be the way to go if I can figure this out. It > seems much simpler than htmllib, and satisfies my requirements. > > htmllib will write out the text data (using the AbstractFormatter and > AbstractWriter), but my problem here is conceptual. I simply don't > understand why all of these different "levels" of abstractness are > necessary, nor how to use them. As an example, the html <i>text</i> > should be converted to ''text'' (double single-quotes at each end) in my > mediawiki markup output. This would obviously be easy to achieve if I > simply had an html parse that called a method for each start tag, text > chunk, and end tag. But htmllib calls the tag functions in HTMLParser, > and then does more things with both a formatter and a writer. To me, > both seem unnecessarily complex (though I suppose I can see the benefits > of a writer before generators gave the opportunity to simply yield > chunks of output to be processed by external code.) In any case, I don't > really have a good idea of what I should do with htmllib to get my > converted tags, and then content, and then closing converted tags, > written out. > > Please feel free to point to examples, code, etc. Probably the simplest > solution would be a way to process text content in HTMLParser.HTMLParser. > > Thanks, > Ken -- http://mail.python.org/mailman/listinfo/python-list