Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

Ivan Sagalaev Tue, 30 Jan 2007 01:30:13 -0800

Michael Radziej wrote:
> I thank you for all your patience with me. I was completely off-track. I
> read all the mails again, and everything is starting to make sense now.


Then I hope not to confuse you (and everyone else) with my answer :-)

> First, contrary to my former opinion, #3370 is a bug in the newforms
> module, as it is passing unicode to the database API which is not ripe
> for it and will break as soon as you leave ASCII.

I wouldn't call it a bug. Newforms are intended to work in unicode. They 
don't play nice with db backends now but it's a question what should be 
changed: newforms to supply byte strings or db backends to accept unicode.

> I see three ways to fix the problem in #3370:
> 
> a) newforms stops passing unicode strings to the Database API and uses
> bytestrings.
> 
> b) the database wrapper in Django sets connection.charset (but needs to
> translate the charset name since the databases don't understand all
> charset name variants, see ticket #952 here). This is the approach of
> the patches in tickets #1356 and #3370.
> 
> c) the database wrapper in Djago must check whether it gets unicode. In
> this case, it needs to encode it into a bytestring.

I believe option a) and b) together will do the work.

Now we have all these confusing bugs because db backends receive two 
kind of inputs: unicode from newforms and byte strings from oldforms (a 
majority of existing code I think). Newforms are now "guilty" of 
introducing unicode into party so I think it's better to keep all the 
conversions there.

Option b) is needed because a db backend should know in which 
single-byte encoding it receives data. The great advantage of unicode is 
that you shouldn't supply a text's language alongside, it's encoded 
right there. But with byte strings it's necessary.

Option c) scares me :-). Because the need in working with byte strings 
(and hence in options a) and b)) remains but also introduces an ability 
to accept but not to issue unicode objects also. I don't think people 
would thank us for this :-)

> With all three variants, what encoding should be used? We currently
> issue (without #952) a 'set name utf8' at the beginning of each
> connection, so the database server expects to receive utf8. So,
> shouldn't we currently always use utf8 encoding, regardless of what is
> in settings.DEFAULT_CHARSET?

No we shouldn't. In fact this was never working properly, #952 is an old 
bug. It kinda works most of the time because the default value of 
DEFAULT_CHARSET is 'utf-8' and most apps don't change it. But if they do 
  and actually work with non utf-8 data then when fed into database 
declared as utf-8 they will break because an arbitrary single-byte 
encoding is not well-formed utf-8.

Databases react differently: Postgres complains that it's not utf-8 and 
refuses to accept garbage (I love Postgres :-) ). MySQL, at least some 
versions, just won't check the encoding and store data as a byte array. 
Sorting and case insensitivity won't work but at least you can SELECT 
everything back unchanged which supports the notion that it "works" :-). 
Actually this means that #3370 is safe to include because it's 
MySQL-only, doesn't affect byte strings at all because of MySQL's 
liberal interface and actually fixes a bug when it receives unicode from 
newforms. I'm against it only because it creates this incomprehensible 
mess of conventions and edge cases neutralizing each other... #952 is 
just a more general way of doing things.

> Well, the current patch in #3370 (I still ignore __repr__) only changes
> the charset attribute of a connection, and this attribute is used only
> to encode unicode strings when sending data to the database, or to
> decode bytestrings received from the database when MySQLdb is configured
> to produce unicode ('use_unicode').

BTW I'm -1 on switching backends to unicode right now because:

1. We should manually decode/encode for backends that can't do it (say, 
psycopg1)

2. We immediately get __str__'s returning unicode objects which will 
open a can of worms of confusions (and flame wars :-) ).

> I don't see a problem with the generic views since they pass bytestrings
> to the database wrapper, this gets as bytestrings to MySQLdb, and for
> bytestrings the charset attribute is not used at all.

Umm... This is the exact problem with byte strings: that they require 
knowledge of a charset somewhere.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

Reply via email to