Re: python_2_unicode_compatible pitfalls

Mikhail Korobov Thu, 27 Dec 2012 12:26:14 -0800

Oh, my description of (5) is not totally correct: u'%r' % bytestring_value 
is fine because repr(non_ascii_bytestring) is an escaped 7bit ascii; this 
mean HttpResponseBase._convert_to_charset is almost fine (bytes would be of 
incorrect encoding, but this won't raise an exception). The argument about 
"%r" % object_with_non_ascii__repr__ still apply.


пятница, 28 декабря 2012 г., 1:20:24 UTC+6 пользователь Mikhail Korobov 
написал:
>
> Hi there,
>
> First of all, many kudos for the Python 3.x support in upcoming django 
> 1.5, and for the way it is handled (the approach, the docs, etc)! 
>
> I think there are some pitfalls with 
> @python_2_unicode_compatible decorator as it currently implemented in 
> django (and __str__/__repr__ in general), and want to share the thoughts 
> before the 1.5 release. I'm sorry that this message is pretty vague; it 
> points to some problems with the current approach (some of them are real, 
> some would occur very rarely) but it doesn't propose  the solution for 
> django other than "please review the code once more".
>
> 1) @python_2_unicode_compatible doesn't handle __repr__.
>    For example, this affects django.db.models.options.Options,
>    django.core.files.base.File (and ContentFile),
>    django.contrib.admin.models.LogEntry, django.template.base.Variable
>    and probably many others (their __repr__ incorrectly returns unicode).
>
>    It also may be the cause why django.db.models.Model.__repr__ doesn't
>    follow Python conventions ("__repr__ should be information-rich and
>    unambiguous" - unicode values are replaced with "[Bad Unicode data]").
>    By the way, the way django detects whether value needs replacing 
>    is not correct and doesn't prevent all errors because what
>    "u = six.text_type(self)" do for bytestring is decode data using
>    sys.getdefaultencoding() while repr is (most?) often used in console,
>    where sys.stdout.encoding matters.
>
> 2) under Python 2.x __str__ is implemented as __unicode__
>    encoded to utf8. This breaks 'print django_obj' when sys.stdout.encoding
>    is not utf8 because print uses __str__ (not __unicode__) for custom 
> objects,
>    and the terminal expects the result to be encoded in sys.stdout.encoding
>    (print encodes unicode strings to sys.stdout.encoding, but doesn't
>    use __unicode__ of objects; this is hard-coded in Python 2.x). 
>    This may affect REPL in Windows consoles and printing/writing to stdout 
>    in management commands.
>
> 3) @python_2_unicode_compatible produces incorrect results 
>    when applied twice (__str__ is patched by previous decorator 
> application 
>    and returns bytestring because of that).
>    This is easy to oversight e.g. when applying this decorator to a
>    subclass of a class which is wrapped to @python_2_unicode_compatible
>    and deleting the overridden __str__ afterwards.
>
> 4) __str__ is not always properly implemented for this decorator in django
>    code. To work properly with @python_2_unicode_compatible,
>    __str__ must return unicode string. This is quite subtle.
>    For example, take a look at django.contrib.gis.maps.google.GEvent.
>    __str__ is implemented as 
>    "return mark_safe('"%s", %s' %(self.event, self.action))",
>    but "from __future__ import unicode_literals" is not applied to the 
> file.
>    This means that if event and action are Python objects with both __str__
>    and __unicode__ methods defined (e.g. object of class wrapped with
>    python_2_unicode_compatible) then __str__ would be called for these 
> objects,
>    not __unicode__ (because the format string is a bytestring). Generally,
>    "%s" % something is a good and correct pattern for __str__ 
> implementation
>    (it does the right thing under both Python 2.x and 3.x when
>    unicode_literals future import is there), but it is incorrect under 
> Python
>    2.x if unicode_literals is not imported.
>
> 5) %r is very tricky. If unicode_literals are in effect, or some 
>    arguments for string formatting are unicode,
>    "%r" % obj would trigger bytes decoding using sys.getdefaultencoding() 
> under
>    Python 2.x (unless obj is an unicode string), and if obj.__repr__ 
> returns
>    non-ascii text or obj is a bytestring, exception would be raised
>    (because sys.getdefaultencoding() is usually ascii).
>    This format specifier is used, for example, in a default_error_messages
>    for django.db.models.fields.Field; after switching to unicode_literals
>    this may start raising UnicodeDecodeExceptions for non-ascii choices
>    if they are custom objects (not unicode strings).
>    Another example is 
> django.http.response.HttpResponseBase._convert_to_charset
>    where BadHeaderError exception is raised: after switching to 
> unicode_literals
>    %r format specifier start triggering decoding of "value" using 
> sys.getdefaultencoding()
>    which is incorrect because "value" is a bytestring of 'charset' 
> encoding under
>    Python 2.x. Another example is django.utils.datastructures.SortedDict:
>    its __repr__ uses '%r: %r' % (k, v) for k, v in six.iteritems(self)
>    which may fail if key is an unicode string and a value is a bytestring
>    or an object with __repr__ returning non-ascii text. Another example
>    is django.utils.encoding.DjangoUnicodeDecodeError
>    (it has incorrect __str__ by the way because it returns unicode) -
>    it uses "%r" for self.obj, with unicode string formatter,
>    and this would blow up if __repr__ of obj returns non-ascii text.
>    There are other places where %r is used and they all are fragile.
>
> I've implemented an another python_2_unicode_compatible decorator 
> (inspired by django's, the idea is cool) for NLTK: 
> https://github.com/nltk/nltk/blob/2and3/nltk/compat.py#L122 which 
> resolves some of issues above (it handles __repr__, limits __str__ and 
> __repr__ to ascii and supports subclassing better). The article (rather 
> lengthy, with some django bashing :) that provides motivation for the 
> decorator used in NLTK: http://kmike.ru/python-with-strings-attached/(the 
> code in the article is a bit outdated, it is not the code used in 
> NLTK; NLTK version was improved, but I didn't update the article yet).
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-developers/-/2ajESIItEVoJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en.

Re: python_2_unicode_compatible pitfalls

Reply via email to