Re: german umlaute on search querys

Karen Tracey Thu, 26 Nov 2009 07:38:53 -0800

On Thu, Nov 26, 2009 at 7:03 AM, Hinnack <henrik.gens...@googlemail.com>wrote:


> Hi Karen,
>
> thanks again for your reply.
> I use Aptana with pydev extension.
> Debugging the app shows the following for search:
> dict: {u'caption': u'f\\xfcr', u'showold': False}
>
>
That's confusing to me, because other than having an extra \ (which could be
an artifact of how it's being displayed), that looks like a correctly-built
unicode object für.

and for qs:
> str: für
> although it seems to be &#65533; instead of ASCII 252 - but this could be,
> because I am sitting on a MAC
> while debugging.
>

Using python manage.py shell might shed more light, I fear the tool here is
assuming an incorrect bytestring encoding and getting in the way.

I cannot recreate anything like what you are seeing.  I have a model Thing
stored in a MySQL DB (using a utf-8 encoded table) with CharField name.
There are two instances of this Thing in the DB that contain für in the
name.  From a python manage.py shell, using Django 1.1.1:

>>> from ttt.models import Thing
>>> import django
>>> django.get_version()
'1.1.1'
>>> ufur = u'f\u00fcr'
>>> print ufur
für
>>> ufur
u'f\xfcr'
>>> ufur.encode('utf-8')
'f\xc3\xbcr'
>>> ufur.encode('iso-8859-1')
'f\xfcr'

small-u with umlaut is U+00FC, encoded in utf-8 that takes 2 bytes C3BC,
encoded in iso-8859-1 it is the 1 byte FC.

Filtering with icontains, using either the Unicode object or the utf-8
encode bytestring version, works properly:

>>> Thing.objects.filter(name__icontains=ufur)
[<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>]
>>> Thing.objects.filter(name__icontains=ufur.encode('utf-8'))
[<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>]

Attempting to filter with an iso-8859-1 encoded bytestring raises an error:

>>> Thing.objects.filter(name__icontains=ufur.encode('iso-8859-1'))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/lib/python2.5/site-packages/django/db/models/manager.py", line
129, in filter
    return self.get_query_set().filter(*args, **kwargs)
  File "/usr/lib/python2.5/site-packages/django/db/models/query.py", line
498, in filter
    return self._filter_or_exclude(False, *args, **kwargs)
  File "/usr/lib/python2.5/site-packages/django/db/models/query.py", line
516, in _filter_or_exclude
    clone.query.add_q(Q(*args, **kwargs))
  File "/usr/lib/python2.5/site-packages/django/db/models/sql/query.py",
line 1675, in add_q
    can_reuse=used_aliases)
  File "/usr/lib/python2.5/site-packages/django/db/models/sql/query.py",
line 1614, in add_filter
    connector)
  File "/usr/lib/python2.5/site-packages/django/db/models/sql/where.py",
line 56, in add
    obj, params = obj.process(lookup_type, value)
  File "/usr/lib/python2.5/site-packages/django/db/models/sql/where.py",
line 269, in process
    params = self.field.get_db_prep_lookup(lookup_type, value)
  File
"/usr/lib/python2.5/site-packages/django/db/models/fields/__init__.py", line
214, in get_db_prep_lookup
    return ["%%%s%%" % connection.ops.prep_for_like_query(value)]
  File "/usr/lib/python2.5/site-packages/django/db/backends/__init__.py",
line 364, in prep_for_like_query
    return smart_unicode(x).replace("\\", "\\\\").replace("%",
"\%").replace("_", "\_")
  File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 44,
in smart_unicode
    return force_unicode(s, encoding, strings_only, errors)
  File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 92,
in force_unicode
    raise DjangoUnicodeDecodeError(s, *e.args)
DjangoUnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2:
unexpected end of data. You passed in 'f\xfcr' (<type 'str'>)

This is because Django assumes the bytestring is utf-8 encoded, and runs
into trouble attempting to convert to unicode specifying utf-8 as the
string's encoding, since it is not valid utf-8 data.

The only way I have been able to recreate anything like what you are
describing is to incorrectly construct the original unicode object from a
utf-8 bytestring assuming a iso-8859-1 encoding:

>>> badufur = ufur.encode('utf-8').decode('iso-8859-1')
>>> badufur
u'f\xc3\xbcr'
>>> print badufur
fÃ¼r
>>> print badufur.encode('utf-8')
fÃ¼r
>>> print badufur.encode('iso-8859-1')
für

Using that unicode object doesn't produce any hits in the DB:

>>> Thing.objects.filter(name__icontains=badufur)
[]

But encoding it to iso-8859-1 does, because that has the effect of restoring
the original utf-8 bytestring:

>>> Thing.objects.filter(name__icontains=badufur.encode('iso-8859-1'))
[<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>]

However, the debug info you show above doesn't show an incorrectly-built
unicode object, so I'm very confused by it.

Karen

--

You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: german umlaute on search querys

Reply via email to