#30892: Slugify and SlugField with Diacritics
--------------------------------------+------------------------
Reporter: originell | Owner: nobody
Type: Bug | Status: new
Component: Utilities | Version: 2.2
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 0
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
--------------------------------------+------------------------
While working on an international project, we discovered that the
turkish/azerbaijani letter `İ` can not be properly processed when
`SlugField` and `slugify` are run with `allow_unicode=True`.
The project itself runs with Django 2.2.6 and Wagtail 2.6.2. I first
talked about this in the Wagtail Support Channel and while researching
further, discovered that this is a Django/Python related issue. This was
tested on Python 3.6 and on Python 3.7.
(quick shoutout to Matt Wescott @gasmanic of Wagtail Fame for being a
sparing partner in this)
There is a rather detailed analysis (README) in a git repo I created
https://github.com/originell/django-wagtail-turkish-i - it was also the
basis for my initial call for help in wagtail's support channel. Meanwhile
I have extended it with a Django-only project, as to be a 100% sure this
has nothing to do with Wagtail.
I was not able to find anything similar in trac. While I encourage whoever
is reading this to actually read the README in the git repo, I want to
honor your time and will try to provide a more concise version of the bug
here.
=== Explanation
`models.py`
{{{#!python
from django.db import models
from django.utils.text import slugify
class Page(models.Model):
title = models.CharField(max_length=255)
slug = models.SlugField(allow_unicode=True)
def __str__(self):
return self.title
}}}
Using this in a shell/test like a (Model)Form might:
{{{#!python
from django.utils.text import slugify
page = Page(title="Hello İstanbul")
page.slug = slugify(page.title, allow_unicode=True)
page.full_clean()
}}}
`full_clean()` then raises
django.core.exceptions.ValidationError: {'slug': ["Enter a valid
'slug' consisting of Unicode letters, numbers, underscores, or hyphens."]}
Why is that?
`slugify` does the following internally:
{{{#!python
re.sub(r'[^\w\s-]', '', value).strip().lower()
}}}
Thing is, Python's `.lower()` of the `İ` in `İstanbul` looks like this:
{{{#!python
>>> [unicodedata.name(character) for character in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
}}}
So, while `slugify()` finishes, the result is then passed into `SlugField`
which uses the `slug_unicode_re` (`^[-\w]+\Z`). However, the ''Combining
Dot Above'' is not a valid `\w`:
{{{#!python
>>> [(character, unicodedata.name(character),
slug_unicode_re.match(character)) for character in 'İ'.lower()]
[
('i', 'LATIN SMALL LETTER I', <re.Match object; span=(0, 1), match='i'>),
# EUREKA!!
('̇', 'COMBINING DOT ABOVE', None)
]
}}}
So that's why the `ValidationError` is raised.
=== Proposed Solution
The culprit seems to be the order in which `lower()` is called in slugify.
The assumption that the lowercase version of a `re.sub`-ed string is still
a valid `slug_unicode_re`, does not seem to hold true.
Hence, instead of doing this in `slugify()`
{{{#!python
re.sub(r'[^\w\s-]', '', value).strip().lower()
}}}
It might be better to do it like this
{{{#!python
re.sub(r'[^\w\s-]', '', value.lower()).strip()
}}}
=== Is Python the actual culprit?
Yeah it might be. Matt (@gasmanic) urged me to also take a look if Python
might be doing this wrong.
- The `İ` is the ''Latin Capital Letter I with Dot Above''. It's
codepoint is `U+0130` According to the chart for the Latin Extended-A set
(https://www.unicode.org/charts/PDF/U0100.pdf), it's lowercase version is
`U+0069`.
- `U+0069` lives in the ''C0 Controls and Basic Latin set''
(https://www.unicode.org/charts/PDF/U0000.pdf). Lo and behold: it is the
''Latin small letter I''. So a latin lowercase `i`.
Does this really mean that Python is doing something weird here by adding
the ''Combining dot above''? Honestly, I can't imagine that and I am
probably missing an important thing here because my view is too naive.
---
I hope this shorter explanation makes sense. If it does not, please try to
read through the detailed analysis in the repo
(https://github.com/originell/django-wagtail-
turkish-i/blob/master/README.md). If that also does not make a ton of
sense, let me know.
In any case, thank you for taking the time to read this bug report.
Looking forward to feedback and thoughts. I am happy to oblige in any
capacity necessary.
--
Ticket URL: <https://code.djangoproject.com/ticket/30892>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
--
You received this message because you are subscribed to the Google Groups
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-updates/052.5691f39139d866e6d30d72cb47f1a43b%40djangoproject.com.