Hi,

On 26/04/2012 22.10, Vinay Sajip wrote:
Following recent changes in html.parser, the Python 3 port of Django I'm working
on has started failing while parsing HTML.

The reason appears to be that Django uses some module-level data in html.parser,
for example tagfind, which is a regular expression pattern. This has changed
recently (Ezio changed it in ba4baaddac8d).

html.parser doesn't use any private _name, so I was considering part of the public API only the documented names. Several methods are marked with an "# internal" comment, but that's not visible unless you go read the source code.

Now tagfind (and other such patterns) are not marked as private (though not
documented), but should they be? The following script (tagfind.py):

     import html.parser as Parser

     data = '<select name="stuff">'

     m = Parser.tagfind.match(data, 1)
     print('%r ->  %r' % (Parser.tagfind.pattern, data[1:m.end()]))

gives different results on 3.2 and 3.3:

     $ python3.2 tagfind.py
     '[a-zA-Z][-.a-zA-Z0-9:_]*' ->  'select'
     $ python3.3 tagfind.py
     '([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' ->  'select'

The trailing space later causes a mismatch with the end tag, and leads to the
errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in
an overridden parse_startag method.

Django shouldn't override parse_starttag (internal and undocumented), but just use handle_starttag (public and documented).
I see two possible reasons why it's overriding parse_starttag:
1) Django is working around an HTMLParser bug. In this case the bug could have been fixed (leading to the breakage of the now-useless workaround), and now you could be able to use the original parse_starttag and have the correct result. If it is indeed working around a bug and the bug is still present, you should report it upstream. 2) Django is implementing an additional feature. Depending on what exactly the code is doing you might want to open a new feature request on the bug tracker. For example the original parse_starttag sets a self.lasttag attribute with the correct name of the last tag parsed. Note however that both parse_starttag and self.lasttag are internal and shouldn't be used directly (but lasttag could be exposed and documented if people really think that it's useful).

Do we need to indicate more strongly that data like tagfind are private? Or has
the change introduced inadvertent breakage, requiring a fix in Python?

I'm not sure that reverting the regex, deprecate all the exposed internal names, and add/use internal _names instead is a good idea at this point. This will cause more breakage, and it would require an extensive renaming. I can add notes to the documentation/docstrings and specify what's private and what's not though. OTOH, if this specific fix is not released yet I can still do something to limit/avoid the breakage.

Best Regards,
Ezio Melotti

Regards,

Vinay Sajip


_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to