#30481: force_text() allows lone surrogates
-------------------------------------+-------------------------------------
               Reporter:  Adam       |          Owner:  nobody
  Hooper                             |
                   Type:             |         Status:  new
  Uncategorized                      |
              Component:  Utilities  |        Version:  2.2
               Severity:  Normal     |       Keywords:  force_text unicode
           Triage Stage:             |      Has patch:  0
  Unreviewed                         |
    Needs documentation:  0          |    Needs tests:  0
Patch needs improvement:  0          |  Easy pickings:  0
                  UI/UX:  0          |
-------------------------------------+-------------------------------------
 {{{
 $ python3
 Python 3.7.3 (default, Mar 27 2019, 13:36:35)
 [GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux
 Type "help", "copyright", "credits" or "license" for more information.

 >>> invalid_text = '\ud802\udf12'
 >>> print(invalid_text)  # we'd expect this to fail
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
 UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1:
 surrogates not allowed

 >>> import django.utils.encoding
 >>> django.VERSION
 (2, 2, 0, 'alpha', 1)

 >>> valid_text = django.utils.encoding.force_text(invalid_text)
 >>> print(valid_text)  # we'd expect this to succeed?
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
 UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1:
 surrogates not allowed

 >>> valid_text
 '\ud802\udf12'
 }}}

 Perhaps this is a flaw in my expectations? I'd expect `force_text()`'s
 output to always be a valid text -- even though Python allows me to create
 _non-text_ `str` objects. (In this case, I'd expect maybe `\ufffd\ufffd`
 -- Unicode replacement characters.)

 Unicode primer: `\ud802` is a "lone surrogate" in this context. A lone
 surrogate is a valid Unicode _code point_ but it does not represent
 _text_. (Lone surrogates can crop up if someone decodes valid UCS-2 as
 UTF-16.) I don't think any caller of `force_text()` expects it to ever
 return a non-textual Unicode string.

-- 
Ticket URL: <https://code.djangoproject.com/ticket/30481>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-updates+unsubscr...@googlegroups.com.
To post to this group, send email to django-updates@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/053.cef81db2df89b3f417ef91c71a9c6a94%40djangoproject.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to