New submission from Chiyuan Zhang <plus...@gmail.com>: Hi all,
I'm using BeautifulSoup to parsing an HTML page and find it refused to parse the page. By looking at the backtrace, I found it is a problem with the python built-in HTMLParser.py. In fact, the web page I'm parsing is with some Chinese characters. there is a tag like <img src=/foo/bar.png alt=中文> , note this is legacy html page where the attributes are not quoted. However, the regexp defined in HTMLParser.py is : attrfind = re.compile( r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*' r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_...@]*))?') Note that the Chinese character (also any other non-english characters), so it fire an error parsing this. I'm not sure whether the HTML standard allow un-quoted non-ASCII characters in the attributes. If it allows, this seems to be a bug. and the regexp to better be [^>\s] IMHO. BTW: It seems something like : <script> var st = "<a></"; </script> can not be parsed. :-/ ---------- components: Library (Lib) messages: 95162 nosy: pluskid severity: normal status: open title: Bug on regexp of HTMLParser type: behavior versions: Python 2.6 _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue7311> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com