Re: Removing accents from strings

Dave Dash Sun, 07 Dec 2008 01:58:17 -0800

Sorry Karen my mistake for leaving that out, that reTagnormalizer just
filtered everything that wasn't alphanumeric, the full code is below.


Also here's the error from manage.py test

File "/restaurant/models.py", line 33, in
mealadvisor.restaurant.models.normalize
Failed example:
    normalize(u' café ')
Expected:
    u'cafe'
Got:
    u'cafa'


***



import unicodedata, re

reTagnormalizer= re.compile(r'[^a-zA-Z0-9]')

reCombining = re.compile(u'[\u0300-\u036f\u1dc0-\u1dff\u20d0-\u20ff
\ufe20-\ufe2f]',re.U)

def remove_diacritics(s):
    " Decomposes string, then removes combining characters "
    return reCombining.sub('',unicodedata.normalize('NFD',unicode
(s)) )


# tag normalizer
def normalize(tag):
    """
    >>> normalize(u'cafe')
    u'cafe'
    >>> normalize(u'caf e')
    u'cafe'
    >>> normalize(u' cafe ')
    u'cafe'

    For now this is wrong I think it's an error with doctest, not the
actual function.

    >>> normalize(u' café ')
    u'cafe'

    >>> normalize(u'cAFe')
    u'cafe'
    >>> normalize(u'%sss%s')
    u'ssss'
    """
    try:
        tag = remove_diacritics(tag)
    except:
        pass

    tag = reTagnormalizer.sub('', tag).lower()
    return tag

On Dec 6, 9:42 pm, "Karen Tracey" <[EMAIL PROTECTED]> wrote:
> On Sat, Dec 6, 2008 at 9:00 PM, Dave Dash <[EMAIL PROTECTED]> wrote:
>
> > Okay I think that fixes one fundamental issue... I've got a unittest,
> > however that fails for a function:
>
> > def normalize(tag):
> >    """
> >    >>> normalize(u'cafe')
> >    u'cafe'
> >    >>> normalize(u'caf e')
> >    u'cafe'
> >    >>> normalize(u' cafe ')
> >    u'cafe'
> >    >>> normalize(u' café ')
> >    u'cafe'
> >    >>> normalize(u'cAFe')
> >    u'cafe'
> >    >>> normalize(u'%sss%s')
> >    u'ssss'
> >    """
> >    try:
> >        tag = remove_diacritics(tag)
> >    except:
> >        pass
>
> >    tag = reTagnormalizer.sub('', tag).lower()
> >    return tag
>
> > It fails on the ' café' and translates it to cafa instead of cafe.
> > THis is only through the unittest framework (doctest) since I can run
> > it from django shell and it works as intended.
>
> > Is this just an issue with doctest?
>
> If I cut and paste your code and take out reTagnormalizer (since you didn't
> post that) and all the tests that seem to depend on what it does vs.
> remove_diacritics, and just test:
>
>    """
>    >>> normalize(u'café')
>    u'cafe'
>    """
> plain Python doctesting it works fine, as does 'manage.py test someapp' (if
> I put the code in somapp's models.py file).
>
> So I can't recreate the error you are reporting based on what you have
> posted.  What's in reTagnormalizer?
>
> Karen
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Removing accents from strings

Reply via email to