On Oct 24, 4:38 pm, josh logan <dear.jay.lo...@gmail.com> wrote: > On Oct 24, 4:36 pm, josh logan <dear.jay.lo...@gmail.com> wrote: > > > > > > > Hello, > > > I wanted to use python to scrub an html file for score data, but I'm > > having trouble. > > I'm using HTMLParser, and the parsing seems to fizzle out around line > > 192 or so. None of the event functions are being called anymore > > (handle_starttag, handle_endtag, etc.) and I don't understand why, > > because it is a html page over 1000 lines. > > > Could someone tell me if this is a bug or simply a misunderstanding on > > how HTMLParser works? I'd really appreciate some help in > > understanding. > > > I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter). > > > I put the HTML file on pastebin, because I couldn't think of anywhere > > better to put it:http://pastebin.com/wu6Pky2W > > > The source code has been pared down to the simplest form to exhibit > > the problem. It is displayed below, and is also on pastebin for > > download (http://pastebin.com/HxwRTqrr): > > > import sys > > import re > > import os.path > > import itertools as it > > import urllib.request > > from html.parser import HTMLParser > > import operator as op > > > base_url = 'http://www.dci.org' > > > class TestParser(HTMLParser): > > > def handle_starttag(self, tag, attrs): > > print('position {}, staring tag {} with attrs > > {}'.format(self.getpos(), tag, attrs)) > > > def handle_endtag(self, tag): > > print('ending tag {}'.format(tag)) > > > def do_parsing_from_file_stream(fname): > > parser = TestParser() > > > with open(fname) as f: > > for num, line in enumerate(f, start=1): > > # print('Sending line {} through parser'.format(num)) > > parser.feed(line) > > > if __name__ == '__main__': > > do_parsing_from_file_stream(sys.argv[1]) > > Sorry, the group doesn't like how i surrounded the Python code's > pastebin URL with parentheses: > > http://pastebin.com/HxwRTqrr
I found the error. The HTML file I'm parsing has invalid HTML at line 193. It has something like: <a href="mystuff "class = "stuff"> Note there is no space between the closing quote for the "href" tag and the class attribute. I guess I'll go through each file and correct these issues as I parse them. Thanks for reading, anyways. -- http://mail.python.org/mailman/listinfo/python-list