#30892: slugify() doesn't return a valid slug for "İ".
-------------------------------------+-------------------------------------
Reporter: Luis Nell | Owner:
| Christoffer Sjöbergsson
Type: Bug | Status: assigned
Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Christoffer Sjöbergsson):
* owner: nobody => Christoffer Sjöbergsson
* status: new => assigned
Comment:
I did some digging into this and found some interesting things that I
thought that I would share.
In short my conclusion is that python does as it should.
Firstly because if you perform a lowercase operation on `İ` in JavaScript
the result becomes the same and secondly because in
[https://unicode.org/Public/UNIDATA/SpecialCasing.txt] which describes
casing rules in some special occasions we can see that the lower case
mapping of `İ` is indeed `['LATIN SMALL LETTER I', 'COMBINING DOT
ABOVE']`.
Things are however a little bit more complicated than that, as it turns
out that the casing operation is performed differently depending on which
locale is used. Since locale settings should not break code I will not go
in to much on it here but for further reading take a look att JavaSript's
toLocalLowerCase or at this stack overflow answer
[https://stackoverflow.com/a/19031612]. If the locale setting 'TR' is used
in these examples then the lowercase version of `İ` is only `LATIN SMALL
LETTER I`.
Now to the possible solution:
Replying to [comment:2 Luis Nell]:
>...I also thought about looping over the characters and simply removing
everything that does not match the builtin slug validation regular
expressions...
As far as I understand it this is mostly what happens by changing the
placement of `lower()` the way you suggests.
`re.sub(r'[^\w\s-]', '', value)` is removing all symbols that are not
standard Unicode characters or spaces which is almost the same regexp as
the slug is then validated against. As previously discovered the problem
is when the `lower()` operation then add new symbols that are not allowed
by the regexp. I would therefore argue that moving `lower()` is a decent
solution because it will make the generated slug validate as long as the
validation regexp is the same as now. I would however make the case for
moving the `lower()` operation to a different place since Unicode
documentation
[https://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf]
states that normalization is not kept during casing operations.
Casing operations as defined in Section 3.13, Default Case Algorithms
are not guaranteed to
preserve Normalization Forms. That is, some strings in a particular
Normalization Form
(for example, NFC) will no longer be in that form after the casing
operation is performed.
Therefore I would argue that it would be better to place the lower
operation over the normalization as follows:
{{{#!python
value = str(value).lower()
if allow_unicode:
value = unicodedata.normalize('NFKC', value)
else:
value = unicodedata.normalize('NFKD', value).encode('ascii',
'ignore').decode('ascii')
value = re.sub(r'[^\w\s-]', '', value).strip()
return re.sub(r'[-\s]+', '-', value)
}}}
This way the string is lower cased then normalized to keep as much of the
special characters as possible and then the remaining symbols are removed
with the regexp substitution.
I guess this could in theory lead to unintended different meaning of words
but I don't know if it would be feasible to do this kind of string
manipulation with guaranteed preserved meaning.
I have started to prepare a patch with my proposed change as well as tests
so I have assigned this issue to me for now. My intention is to submit a
patch later this week that could be tested by a few others maybe to check
for further issues. Luis Nell, if you feel that you would rather write the
patch yourself and that I overstepped by assigning myself, just claim the
issue for yourself no hard feelings on my part :)
--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:3>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
--
You received this message because you are subscribed to the Google Groups
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/django-updates/067.62940c8f406ac8e060f22bf8c4109f8f%40djangoproject.com.