Re: [Django] #30892: slugify() doesn't return a valid slug for "İ".

Django Tue, 10 Dec 2019 06:20:54 -0800

#30892: slugify() doesn't return a valid slug for "İ".
-------------------------------------+-------------------------------------
     Reporter:  Luis Nell            |                    Owner:
                                     |  Christoffer Sjöbergsson
         Type:  Bug                  |                   Status:  assigned
    Component:  Utilities            |                  Version:  master
     Severity:  Normal               |               Resolution:
     Keywords:                       |             Triage Stage:  Accepted
    Has patch:  0                    |      Needs documentation:  0
  Needs tests:  0                    |  Patch needs improvement:  0
Easy pickings:  0                    |                    UI/UX:  0
-------------------------------------+-------------------------------------
Changes (by Christoffer Sjöbergsson):


 * owner:  nobody => Christoffer Sjöbergsson
 * status:  new => assigned


Comment:

 I did some digging into this and found some interesting things that I
 thought that I would share.

 In short my conclusion is that python does as it should.

 Firstly because if you perform a lowercase operation on `İ` in JavaScript
 the result becomes the same and secondly because in
 [https://unicode.org/Public/UNIDATA/SpecialCasing.txt]  which describes
 casing rules in some special occasions we can see that the lower case
 mapping of `İ` is indeed `['LATIN SMALL LETTER I', 'COMBINING DOT
 ABOVE']`.

 Things are however a little bit more complicated than that, as it turns
 out that the casing operation is performed differently depending on which
 locale is used. Since locale settings should not break code I will not go
 in to much on it here but for further reading take a look att JavaSript's
 toLocalLowerCase or at this stack overflow answer
 [https://stackoverflow.com/a/19031612]. If the locale setting 'TR' is used
 in these examples then the lowercase version of `İ` is only `LATIN SMALL
 LETTER I`.


 Now to the possible solution:

 Replying to [comment:2 Luis Nell]:
 >...I also thought about looping over the characters and simply removing
 everything that does not match the builtin slug validation regular
 expressions...

 As far as I understand it this is mostly what happens by changing the
 placement of `lower()` the way you suggests.
 `re.sub(r'[^\w\s-]', '', value)` is removing all symbols that are not
 standard Unicode characters or spaces which is almost the same regexp as
 the slug is then validated against. As previously discovered the problem
 is when the `lower()` operation then add new symbols that are not allowed
 by the regexp. I would therefore argue that moving `lower()` is a decent
 solution because it will make the generated slug validate as long as the
 validation regexp is the same as now.  I would however make the case for
 moving the `lower()` operation to a different place since Unicode
 documentation
 [https://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf]
 states that normalization is not kept during casing operations.

     Casing operations as defined in Section 3.13, Default Case Algorithms
 are not guaranteed to
     preserve Normalization Forms. That is, some strings in a particular
 Normalization Form
     (for example, NFC) will no longer be in that form after the casing
 operation is performed.

 Therefore I would argue that it would be better to place the lower
 operation over the normalization as follows:
   {{{#!python
 value = str(value).lower()
 if allow_unicode:
     value = unicodedata.normalize('NFKC', value)
 else:
     value = unicodedata.normalize('NFKD', value).encode('ascii',
 'ignore').decode('ascii')
 value = re.sub(r'[^\w\s-]', '', value).strip()
 return re.sub(r'[-\s]+', '-', value)
 }}}

 This way the string is lower cased then normalized to keep as much of the
 special characters as possible and then the remaining symbols are removed
 with the regexp substitution.

 I guess this could in theory lead to unintended different meaning of words
 but I don't know if it would be feasible to do this kind of string
 manipulation with guaranteed preserved meaning.

 I have started to prepare a patch with my proposed change as well as tests
 so I have assigned this issue to me for now. My intention is to submit a
 patch later this week that could be tested by a few others maybe to check
 for further issues. Luis Nell, if you feel that you would rather write the
 patch yourself and that I overstepped by assigning myself, just claim the
 issue for yourself no hard feelings on my part :)

-- 
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:3>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/067.62940c8f406ac8e060f22bf8c4109f8f%40djangoproject.com.

Re: [Django] #30892: slugify() doesn't return a valid slug for "İ".

Reply via email to