Michael Radziej wrote:
> I thank you for all your patience with me. I was completely off-track. I
> read all the mails again, and everything is starting to make sense now.
Then I hope not to confuse you (and everyone else) with my answer :-)
> First, contrary to my former opinion, #3370 is a bug in the newforms
> module, as it is passing unicode to the database API which is not ripe
> for it and will break as soon as you leave ASCII.
I wouldn't call it a bug. Newforms are intended to work in unicode. They
don't play nice with db backends now but it's a question what should be
changed: newforms to supply byte strings or db backends to accept unicode.
> I see three ways to fix the problem in #3370:
>
> a) newforms stops passing unicode strings to the Database API and uses
> bytestrings.
>
> b) the database wrapper in Django sets connection.charset (but needs to
> translate the charset name since the databases don't understand all
> charset name variants, see ticket #952 here). This is the approach of
> the patches in tickets #1356 and #3370.
>
> c) the database wrapper in Djago must check whether it gets unicode. In
> this case, it needs to encode it into a bytestring.
I believe option a) and b) together will do the work.
Now we have all these confusing bugs because db backends receive two
kind of inputs: unicode from newforms and byte strings from oldforms (a
majority of existing code I think). Newforms are now "guilty" of
introducing unicode into party so I think it's better to keep all the
conversions there.
Option b) is needed because a db backend should know in which
single-byte encoding it receives data. The great advantage of unicode is
that you shouldn't supply a text's language alongside, it's encoded
right there. But with byte strings it's necessary.
Option c) scares me :-). Because the need in working with byte strings
(and hence in options a) and b)) remains but also introduces an ability
to accept but not to issue unicode objects also. I don't think people
would thank us for this :-)
> With all three variants, what encoding should be used? We currently
> issue (without #952) a 'set name utf8' at the beginning of each
> connection, so the database server expects to receive utf8. So,
> shouldn't we currently always use utf8 encoding, regardless of what is
> in settings.DEFAULT_CHARSET?
No we shouldn't. In fact this was never working properly, #952 is an old
bug. It kinda works most of the time because the default value of
DEFAULT_CHARSET is 'utf-8' and most apps don't change it. But if they do
and actually work with non utf-8 data then when fed into database
declared as utf-8 they will break because an arbitrary single-byte
encoding is not well-formed utf-8.
Databases react differently: Postgres complains that it's not utf-8 and
refuses to accept garbage (I love Postgres :-) ). MySQL, at least some
versions, just won't check the encoding and store data as a byte array.
Sorting and case insensitivity won't work but at least you can SELECT
everything back unchanged which supports the notion that it "works" :-).
Actually this means that #3370 is safe to include because it's
MySQL-only, doesn't affect byte strings at all because of MySQL's
liberal interface and actually fixes a bug when it receives unicode from
newforms. I'm against it only because it creates this incomprehensible
mess of conventions and edge cases neutralizing each other... #952 is
just a more general way of doing things.
> Well, the current patch in #3370 (I still ignore __repr__) only changes
> the charset attribute of a connection, and this attribute is used only
> to encode unicode strings when sending data to the database, or to
> decode bytestrings received from the database when MySQLdb is configured
> to produce unicode ('use_unicode').
BTW I'm -1 on switching backends to unicode right now because:
1. We should manually decode/encode for backends that can't do it (say,
psycopg1)
2. We immediately get __str__'s returning unicode objects which will
open a can of worms of confusions (and flame wars :-) ).
> I don't see a problem with the generic views since they pass bytestrings
> to the database wrapper, this gets as bytestrings to MySQLdb, and for
> bytestrings the charset attribute is not used at all.
Umm... This is the exact problem with byte strings: that they require
knowledge of a charset somewhere.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---