#30892: Slugify and SlugField with Diacritics
--------------------------------------+------------------------
               Reporter:  originell   |          Owner:  nobody
                   Type:  Bug         |         Status:  new
              Component:  Utilities   |        Version:  2.2
               Severity:  Normal      |       Keywords:
           Triage Stage:  Unreviewed  |      Has patch:  0
    Needs documentation:  0           |    Needs tests:  0
Patch needs improvement:  0           |  Easy pickings:  0
                  UI/UX:  0           |
--------------------------------------+------------------------
 While working on an international project, we discovered that the
 turkish/azerbaijani letter `İ` can not be properly processed when
 `SlugField` and `slugify` are run with `allow_unicode=True`.

 The project itself runs with Django 2.2.6 and Wagtail 2.6.2. I first
 talked about this in the Wagtail Support Channel and while researching
 further, discovered that this is a Django/Python related issue. This was
 tested on Python 3.6 and on Python 3.7.

 (quick shoutout to Matt Wescott @gasmanic of Wagtail Fame for being a
 sparing partner in this)

 There is a rather detailed analysis (README) in a git repo I created
 https://github.com/originell/django-wagtail-turkish-i - it was also the
 basis for my initial call for help in wagtail's support channel. Meanwhile
 I have extended it with a Django-only project, as to be a 100% sure this
 has nothing to do with Wagtail.

 I was not able to find anything similar in trac. While I encourage whoever
 is reading this to actually read the README in the git repo, I want to
 honor your time and will try to provide a more concise version of the bug
 here.

 ===  Explanation

 `models.py`

 {{{#!python
 from django.db import models
 from django.utils.text import slugify


 class Page(models.Model):
     title = models.CharField(max_length=255)
     slug = models.SlugField(allow_unicode=True)

     def __str__(self):
         return self.title
 }}}

 Using this in a shell/test like a (Model)Form might:

 {{{#!python
 from django.utils.text import slugify

 page = Page(title="Hello İstanbul")
 page.slug = slugify(page.title, allow_unicode=True)
 page.full_clean()
 }}}

 `full_clean()` then raises

     django.core.exceptions.ValidationError: {'slug': ["Enter a valid
 'slug' consisting of Unicode letters, numbers, underscores, or hyphens."]}

 Why is that?

 `slugify` does the following internally:

 {{{#!python
 re.sub(r'[^\w\s-]', '', value).strip().lower()
 }}}

 Thing is, Python's `.lower()` of the `İ` in `İstanbul` looks like this:

 {{{#!python
 >>> [unicodedata.name(character) for character in 'İ'.lower()]
 ['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
 }}}

 So, while `slugify()` finishes, the result is then passed into `SlugField`
 which uses the `slug_unicode_re` (`^[-\w]+\Z`). However, the ''Combining
 Dot Above'' is not a valid `\w`:

 {{{#!python
 >>>  [(character, unicodedata.name(character),
 slug_unicode_re.match(character)) for character in 'İ'.lower()]
 [
  ('i', 'LATIN SMALL LETTER I', <re.Match object; span=(0, 1), match='i'>),
  # EUREKA!!
  ('̇', 'COMBINING DOT ABOVE', None)
 ]
 }}}

 So that's why the `ValidationError` is raised.

 === Proposed Solution

 The culprit seems to be the order in which `lower()` is called in slugify.
 The assumption that the lowercase version of a `re.sub`-ed string is still
 a valid `slug_unicode_re`, does not seem to hold true.

 Hence, instead of doing this in `slugify()`

 {{{#!python
 re.sub(r'[^\w\s-]', '', value).strip().lower()
 }}}

 It might be better to do it like this

 {{{#!python
 re.sub(r'[^\w\s-]', '', value.lower()).strip()
 }}}

 === Is Python the actual culprit?

 Yeah it might be. Matt (@gasmanic) urged me to also take a look if Python
 might be doing this wrong.

 -   The `İ` is the ''Latin Capital Letter I with Dot Above''. It's
 codepoint is `U+0130` According to the chart for the Latin Extended-A set
 (https://www.unicode.org/charts/PDF/U0100.pdf), it's lowercase version is
 `U+0069`.
 -   `U+0069` lives in the ''C0 Controls and Basic Latin set''
 (https://www.unicode.org/charts/PDF/U0000.pdf). Lo and behold: it is the
 ''Latin small letter I''. So a latin lowercase `i`.

 Does this really mean that Python is doing something weird here by adding
 the ''Combining dot above''? Honestly, I can't imagine that and I am
 probably missing an important thing here because my view is too naive.

 ---

 I hope this shorter explanation makes sense. If it does not, please try to
 read through the detailed analysis in the repo
 (https://github.com/originell/django-wagtail-
 turkish-i/blob/master/README.md). If that also does not make a ton of
 sense, let me know.

 In any case, thank you for taking the time to read this bug report.
 Looking forward to feedback and thoughts. I am happy to oblige in any
 capacity necessary.

-- 
Ticket URL: <https://code.djangoproject.com/ticket/30892>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/052.5691f39139d866e6d30d72cb47f1a43b%40djangoproject.com.

Reply via email to