Someone should contact the Django folks. Alex Gaynor? On Thursday, April 26, 2012, Ezio Melotti wrote:
> Hi, > > On 26/04/2012 22.10, Vinay Sajip wrote: > >> Following recent changes in html.parser, the Python 3 port of Django I'm >> working >> on has started failing while parsing HTML. >> >> The reason appears to be that Django uses some module-level data in >> html.parser, >> for example tagfind, which is a regular expression pattern. This has >> changed >> recently (Ezio changed it in ba4baaddac8d). >> > > html.parser doesn't use any private _name, so I was considering part of > the public API only the documented names. Several methods are marked with > an "# internal" comment, but that's not visible unless you go read the > source code. > > Now tagfind (and other such patterns) are not marked as private (though >> not >> documented), but should they be? The following script (tagfind.py): >> >> import html.parser as Parser >> >> data = '<select name="stuff">' >> >> m = Parser.tagfind.match(data, 1) >> print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()])) >> >> gives different results on 3.2 and 3.3: >> >> $ python3.2 tagfind.py >> '[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select' >> $ python3.3 tagfind.py >> '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:**\\s|/(?!>))*' -> 'select' >> >> The trailing space later causes a mismatch with the end tag, and leads to >> the >> errors. Django's use of the tagfind pattern is in a subclass of >> HTMLParser, in >> an overridden parse_startag method. >> > > Django shouldn't override parse_starttag (internal and undocumented), but > just use handle_starttag (public and documented). > I see two possible reasons why it's overriding parse_starttag: > 1) Django is working around an HTMLParser bug. In this case the bug > could have been fixed (leading to the breakage of the now-useless > workaround), and now you could be able to use the original parse_starttag > and have the correct result. If it is indeed working around a bug and the > bug is still present, you should report it upstream. > 2) Django is implementing an additional feature. Depending on what > exactly the code is doing you might want to open a new feature request on > the bug tracker. For example the original parse_starttag sets a > self.lasttag attribute with the correct name of the last tag parsed. Note > however that both parse_starttag and self.lasttag are internal and > shouldn't be used directly (but lasttag could be exposed and documented if > people really think that it's useful). > > Do we need to indicate more strongly that data like tagfind are private? >> Or has >> the change introduced inadvertent breakage, requiring a fix in Python? >> > > I'm not sure that reverting the regex, deprecate all the exposed internal > names, and add/use internal _names instead is a good idea at this point. > This will cause more breakage, and it would require an extensive renaming. > I can add notes to the documentation/docstrings and specify what's private > and what's not though. > OTOH, if this specific fix is not released yet I can still do something to > limit/avoid the breakage. > > Best Regards, > Ezio Melotti > > Regards, >> >> Vinay Sajip >> >> > ______________________________**_________________ > Python-Dev mailing list > Python-Dev@python.org > http://mail.python.org/**mailman/listinfo/python-dev<http://mail.python.org/mailman/listinfo/python-dev> > Unsubscribe: http://mail.python.org/**mailman/options/python-dev/** > guido%40python.org<http://mail.python.org/mailman/options/python-dev/guido%40python.org> > -- --Guido van Rossum (python.org/~guido)
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com