On Oct 24, 4:36 pm, josh logan <dear.jay.lo...@gmail.com> wrote: > Hello, > > I wanted to use python to scrub an html file for score data, but I'm > having trouble. > I'm using HTMLParser, and the parsing seems to fizzle out around line > 192 or so. None of the event functions are being called anymore > (handle_starttag, handle_endtag, etc.) and I don't understand why, > because it is a html page over 1000 lines. > > Could someone tell me if this is a bug or simply a misunderstanding on > how HTMLParser works? I'd really appreciate some help in > understanding. > > I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter). > > I put the HTML file on pastebin, because I couldn't think of anywhere > better to put it:http://pastebin.com/wu6Pky2W > > The source code has been pared down to the simplest form to exhibit > the problem. It is displayed below, and is also on pastebin for > download (http://pastebin.com/HxwRTqrr): > > import sys > import re > import os.path > import itertools as it > import urllib.request > from html.parser import HTMLParser > import operator as op > > base_url = 'http://www.dci.org' > > class TestParser(HTMLParser): > > def handle_starttag(self, tag, attrs): > print('position {}, staring tag {} with attrs > {}'.format(self.getpos(), tag, attrs)) > > def handle_endtag(self, tag): > print('ending tag {}'.format(tag)) > > def do_parsing_from_file_stream(fname): > parser = TestParser() > > with open(fname) as f: > for num, line in enumerate(f, start=1): > # print('Sending line {} through parser'.format(num)) > parser.feed(line) > > if __name__ == '__main__': > do_parsing_from_file_stream(sys.argv[1])
Sorry, the group doesn't like how i surrounded the Python code's pastebin URL with parentheses: http://pastebin.com/HxwRTqrr -- http://mail.python.org/mailman/listinfo/python-list