Hello, I wanted to use python to scrub an html file for score data, but I'm having trouble. I'm using HTMLParser, and the parsing seems to fizzle out around line 192 or so. None of the event functions are being called anymore (handle_starttag, handle_endtag, etc.) and I don't understand why, because it is a html page over 1000 lines.
Could someone tell me if this is a bug or simply a misunderstanding on how HTMLParser works? I'd really appreciate some help in understanding. I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter). I put the HTML file on pastebin, because I couldn't think of anywhere better to put it: http://pastebin.com/wu6Pky2W The source code has been pared down to the simplest form to exhibit the problem. It is displayed below, and is also on pastebin for download (http://pastebin.com/HxwRTqrr): import sys import re import os.path import itertools as it import urllib.request from html.parser import HTMLParser import operator as op base_url = 'http://www.dci.org' class TestParser(HTMLParser): def handle_starttag(self, tag, attrs): print('position {}, staring tag {} with attrs {}'.format(self.getpos(), tag, attrs)) def handle_endtag(self, tag): print('ending tag {}'.format(tag)) def do_parsing_from_file_stream(fname): parser = TestParser() with open(fname) as f: for num, line in enumerate(f, start=1): # print('Sending line {} through parser'.format(num)) parser.feed(line) if __name__ == '__main__': do_parsing_from_file_stream(sys.argv[1]) -- http://mail.python.org/mailman/listinfo/python-list