On Thu, Nov 26, 2009 at 7:03 AM, Hinnack <henrik.gens...@googlemail.com>wrote:
> Hi Karen, > > thanks again for your reply. > I use Aptana with pydev extension. > Debugging the app shows the following for search: > dict: {u'caption': u'f\\xfcr', u'showold': False} > > That's confusing to me, because other than having an extra \ (which could be an artifact of how it's being displayed), that looks like a correctly-built unicode object für. and for qs: > str: für > although it seems to be � instead of ASCII 252 - but this could be, > because I am sitting on a MAC > while debugging. > Using python manage.py shell might shed more light, I fear the tool here is assuming an incorrect bytestring encoding and getting in the way. I cannot recreate anything like what you are seeing. I have a model Thing stored in a MySQL DB (using a utf-8 encoded table) with CharField name. There are two instances of this Thing in the DB that contain für in the name. From a python manage.py shell, using Django 1.1.1: >>> from ttt.models import Thing >>> import django >>> django.get_version() '1.1.1' >>> ufur = u'f\u00fcr' >>> print ufur für >>> ufur u'f\xfcr' >>> ufur.encode('utf-8') 'f\xc3\xbcr' >>> ufur.encode('iso-8859-1') 'f\xfcr' small-u with umlaut is U+00FC, encoded in utf-8 that takes 2 bytes C3BC, encoded in iso-8859-1 it is the 1 byte FC. Filtering with icontains, using either the Unicode object or the utf-8 encode bytestring version, works properly: >>> Thing.objects.filter(name__icontains=ufur) [<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>] >>> Thing.objects.filter(name__icontains=ufur.encode('utf-8')) [<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>] Attempting to filter with an iso-8859-1 encoded bytestring raises an error: >>> Thing.objects.filter(name__icontains=ufur.encode('iso-8859-1')) Traceback (most recent call last): File "<console>", line 1, in <module> File "/usr/lib/python2.5/site-packages/django/db/models/manager.py", line 129, in filter return self.get_query_set().filter(*args, **kwargs) File "/usr/lib/python2.5/site-packages/django/db/models/query.py", line 498, in filter return self._filter_or_exclude(False, *args, **kwargs) File "/usr/lib/python2.5/site-packages/django/db/models/query.py", line 516, in _filter_or_exclude clone.query.add_q(Q(*args, **kwargs)) File "/usr/lib/python2.5/site-packages/django/db/models/sql/query.py", line 1675, in add_q can_reuse=used_aliases) File "/usr/lib/python2.5/site-packages/django/db/models/sql/query.py", line 1614, in add_filter connector) File "/usr/lib/python2.5/site-packages/django/db/models/sql/where.py", line 56, in add obj, params = obj.process(lookup_type, value) File "/usr/lib/python2.5/site-packages/django/db/models/sql/where.py", line 269, in process params = self.field.get_db_prep_lookup(lookup_type, value) File "/usr/lib/python2.5/site-packages/django/db/models/fields/__init__.py", line 214, in get_db_prep_lookup return ["%%%s%%" % connection.ops.prep_for_like_query(value)] File "/usr/lib/python2.5/site-packages/django/db/backends/__init__.py", line 364, in prep_for_like_query return smart_unicode(x).replace("\\", "\\\\").replace("%", "\%").replace("_", "\_") File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 44, in smart_unicode return force_unicode(s, encoding, strings_only, errors) File "/usr/lib/python2.5/site-packages/django/utils/encoding.py", line 92, in force_unicode raise DjangoUnicodeDecodeError(s, *e.args) DjangoUnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: unexpected end of data. You passed in 'f\xfcr' (<type 'str'>) This is because Django assumes the bytestring is utf-8 encoded, and runs into trouble attempting to convert to unicode specifying utf-8 as the string's encoding, since it is not valid utf-8 data. The only way I have been able to recreate anything like what you are describing is to incorrectly construct the original unicode object from a utf-8 bytestring assuming a iso-8859-1 encoding: >>> badufur = ufur.encode('utf-8').decode('iso-8859-1') >>> badufur u'f\xc3\xbcr' >>> print badufur für >>> print badufur.encode('utf-8') für >>> print badufur.encode('iso-8859-1') für Using that unicode object doesn't produce any hits in the DB: >>> Thing.objects.filter(name__icontains=badufur) [] But encoding it to iso-8859-1 does, because that has the effect of restoring the original utf-8 bytestring: >>> Thing.objects.filter(name__icontains=badufur.encode('iso-8859-1')) [<Thing: für inserted as unicode>, <Thing: für inserted as utf8 bytestring>] However, the debug info you show above doesn't show an incorrectly-built unicode object, so I'm very confused by it. Karen -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-us...@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.