unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-03-23 Thread dummy
Hi,

I came across thoes tickets and the corresponding thread just yesterday and as 
fas as I understood the main problem is that newforms ist talking unicode 
internally and at the interface to the django-ORM.

I attached my solution to this problem for django.newforms.models (diffed 
against latest SVN), which does an encoding to settings.DEFAULT_CHARSET onto 
the save() between the newforms and django-ORMs.

This patch wouldn't be needed or could be removed if django-ORM and/or 
db-backends all talk unicode/utf-8.

Regards,
Dirk
-- 
"Feel free" - 5 GB Mailbox, 50 FreeSMS/Monat ...
Jetzt GMX ProMail testen: www.gmx.net/de/go/mailfooter/promail-out

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---

Index: models.py
===
--- models.py	(Revision 4775)
+++ models.py	(Arbeitskopie)
@@ -8,6 +8,7 @@
 from forms import BaseForm, DeclarativeFieldsMetaclass, SortedDictFromList
 from fields import Field, ChoiceField
 from widgets import Select, SelectMultiple, MultipleHiddenInput
+from django.conf import settings
 
 __all__ = ('save_instance', 'form_for_model', 'form_for_instance', 'form_for_fields',
'ModelChoiceField', 'ModelMultipleChoiceField')
@@ -38,7 +39,10 @@
 for f in opts.fields:
 if not f.editable or isinstance(f, models.AutoField):
 continue
-setattr(instance, f.name, clean_data[f.name])
+try:
+  setattr(instance, f.name, clean_data[f.name].encode(settings.DEFAULT_CHARSET))
+except:
+  setattr(instance, f.name, clean_data[f.name])
 if commit:
 instance.save()
 for f in opts.many_to_many:


Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-15 Thread Ivan Sagalaev

Adrian Holovaty wrote:
> Hi Ivan,
> 
> Could you explain again why you think newforms should output
> clean_data as bytestrings rather than Unicode strings?

I don't think so :-). Quite the opposite. I think it's good that you 
made newforms in unicode since it effectively gives a start to 
unicodification and lets us test it early.

But there is one issue that is very annoying: when newform saves data to 
a model instance it fills it with unicode. This creates some issues:

- part of the time an object stores bytes (when loaded from db), part of 
the time -- unicode (when updated from newforms)
- __str__s return unicode instead of str
- some backends versions can't handle unicode and try to implicitly 
str() in ascii it with UnicodeEncodeError's

The patch that I'm advocating for here does one little thing: when a 
newform saves an instance it converts unicode into DEFAULT_CHARSET. In 
other words it makes newforms do decode/encode on both their boundaries: 
not only on the side facing the web (POSTs, templates) but also on db side.

> Are you suggesting that we would convert newforms
> clean_data *back* to being Unicode *after* we convert the rest of the
> framework to be Unicode-aware?

No, clean_data will remain in unicode (and for good). Encoding happens 
only when clean_data is actually applied to a model instance.

> I apologize in advance if you've already brought this up and explained
> it. Just trying to understand your thinking here.

Yeah, I understand that this thread is rather long :-)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-15 Thread Michael Radziej

Adrian Holovaty:
> On 2/15/07, Ivan Sagalaev <[EMAIL PROTECTED]> wrote:
>> I tried to show that it leaves out only two things:
>>
>> - if DEFAULT_CHARSET is different than DB charset it won't work (but
>> it's a weird situation, most legacy systems have one legacy encoding for
>> both)
>>
>> - it doesn't help if unicode is actually put into models or in raw SQL
>> manually but this bug was never about it anyway and won't break anything
>> since it fixes newforms, not backends
> 
> Hi Ivan,
> 
> Could you explain again why you think newforms should output
> clean_data as bytestrings rather than Unicode strings?

The current situation is this:

* newforms puts unicode into objects that used to receive only
  UTF-8 encoded bytestrings

* the models (and other parts) only work with bytestrings
  (or as long as the unicode contains only characters from ASCII)

* it is hard to convert all the rest of Django to be able to deal
  with unicode and bytestring at the same time, and it seems that
  this  has been postponed until after 1.0.

Ivan proposes a fix that tries to convert unicode to bytestrings at
the boundary of newforms by encoding unicode to bytestrings in
clean_data. (I have not checked whether this resolves all or at
least a big part of the problem, and I don't have a position about
this, yet.)

It looks like a step backwards, but as long is we don't try to make
everything unicode compatible at the same time, we need to encode
the unicode strings at some boundary or the other. Is clean_data the
right point for this?


Michael

-- 
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100

http://www.noris.de - The IT-Outsourcing Company

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-15 Thread Adrian Holovaty

On 2/15/07, Ivan Sagalaev <[EMAIL PROTECTED]> wrote:
> I tried to show that it leaves out only two things:
>
> - if DEFAULT_CHARSET is different than DB charset it won't work (but
> it's a weird situation, most legacy systems have one legacy encoding for
> both)
>
> - it doesn't help if unicode is actually put into models or in raw SQL
> manually but this bug was never about it anyway and won't break anything
> since it fixes newforms, not backends

Hi Ivan,

Could you explain again why you think newforms should output
clean_data as bytestrings rather than Unicode strings?

If I understand your argument correctly, you're saying newforms should
be rolled back to bytestrings because the rest of the framework isn't
Unicode-aware yet. Are you suggesting that we would convert newforms
clean_data *back* to being Unicode *after* we convert the rest of the
framework to be Unicode-aware?

I apologize in advance if you've already brought this up and explained
it. Just trying to understand your thinking here.

Adrian

-- 
Adrian Holovaty
holovaty.com | djangoproject.com

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-15 Thread Ivan Sagalaev

Michael Radziej wrote:
> (meeting postponed ...)

Nice :-)

> You're right, sorry. I was in a different ticket and somehow thought
> it was the same.
> 
> Yes, #3370 looks interesting and is a different solution. I'm not
> sure whether it deals with all the issues of this thread.

I tried to show that it leaves out only two things:

- if DEFAULT_CHARSET is different than DB charset it won't work (but 
it's a weird situation, most legacy systems have one legacy encoding for 
both)

- it doesn't help if unicode is actually put into models or in raw SQL 
manually but this bug was never about it anyway and won't break anything 
since it fixes newforms, not backends

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-15 Thread Michael Radziej

Ivan Sagalaev:
> Michael Radziej wrote:
>> Ivan Sagalaev:
>>
>>> Michael, the ticket http://code.djangoproject.com/ticket/3370 just got a 
>>> patch that does a) and it's really small. It's not as full as having b) 
>>> and d) but I think they are really a corner cases: b) for different 
>>> encodings in DB and in web, d) for handling unicode input to DB backend 
>>> *without* newforms.
>>>
>>> In other words I think that patch is just right for current situation 
>>> because it fixes the bug for people trying to use newforms now. I'm +1 
>>> on just committing it as is.
>> I'm not sure if the fix is on the right level. StrAndUnicode is used
>> in a lot of places. Is it sure that it won't put xmlcharref-encoded
>> data into the database? I only had a very quick look on it (and I
>> need to go to a meeting now).
> 
> Uhm... Are we talking about the same patch? This is it: 
> http://code.djangoproject.com/attachment/ticket/3370/models.py.diff
> 
> It doesn't mention StrAndUnicode at all.

(meeting postponed ...)

You're right, sorry. I was in a different ticket and somehow thought
it was the same.

Yes, #3370 looks interesting and is a different solution. I'm not
sure whether it deals with all the issues of this thread.

Michael


-- 
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100

http://www.noris.de - The IT-Outsourcing Company

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-15 Thread Ivan Sagalaev

Michael Radziej wrote:
> Ivan Sagalaev:
> 
>> Michael, the ticket http://code.djangoproject.com/ticket/3370 just got a 
>> patch that does a) and it's really small. It's not as full as having b) 
>> and d) but I think they are really a corner cases: b) for different 
>> encodings in DB and in web, d) for handling unicode input to DB backend 
>> *without* newforms.
>>
>> In other words I think that patch is just right for current situation 
>> because it fixes the bug for people trying to use newforms now. I'm +1 
>> on just committing it as is.
> 
> I'm not sure if the fix is on the right level. StrAndUnicode is used
> in a lot of places. Is it sure that it won't put xmlcharref-encoded
> data into the database? I only had a very quick look on it (and I
> need to go to a meeting now).

Uhm... Are we talking about the same patch? This is it: 
http://code.djangoproject.com/attachment/ticket/3370/models.py.diff

It doesn't mention StrAndUnicode at all.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-15 Thread Ivan Sagalaev

Michael Radziej wrote:
> A few days ago, I wrote:
>> I see three ways to fix the problem in #3370:
>>
>> a) newforms stops passing unicode strings to the Database API and uses
>> bytestrings.
>>
>> b) the database wrapper in Django sets connection.charset (but needs to
>> translate the charset name since the databases don't understand all
>> charset name variants, see ticket #952 here). This is the approach of
>> the patches in tickets #1356 and #3370.
>>
>> c) the database wrapper in Djago must check whether it gets unicode. In
>> this case, it needs to encode it into a bytestring.
> 
> I now see a fourth way that would resolve #952 at the same time:
> 
> d) make the database wrapper accept both unicode and bytestrings in
> the models, but always pass unicode strings to the database backend.

Michael, the ticket http://code.djangoproject.com/ticket/3370 just got a 
patch that does a) and it's really small. It's not as full as having b) 
and d) but I think they are really a corner cases: b) for different 
encodings in DB and in web, d) for handling unicode input to DB backend 
*without* newforms.

In other words I think that patch is just right for current situation 
because it fixes the bug for people trying to use newforms now. I'm +1 
on just committing it as is.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-04 Thread Bjørn Stabell

On Feb 1, 4:16 pm, Michael Radziej <[EMAIL PROTECTED]> wrote:
> Ivan Sagalaev:
>
> > Michael Radziej wrote:
> >> d) make the database wrapper accept both unicode and bytestrings in
> >> the models, but always pass unicode strings to the database backend.

Sounds like a reasonable proposal.  You may even consider logging
deprectation messages in the case of bytestrings appearing in models
(but be careful not to create a flood of these).


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-02 Thread Julian 'Julik' Tarkhanov


On Jan 27, 2007, at 6:44 PM, ak wrote:

> And another thing I still don't understand is: let's pretend I use
> MySQL 4.0 with national charset and my templates are in the same
> charset too. How would work:

It should not work.
-- 
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-02-02 Thread Julian 'Julik' Tarkhanov


On Jan 27, 2007, at 6:44 PM, ak wrote:

> 1. newforms are with unicode inside
> 2. ORM is with str inside
3. welcome to the world of pain
-- 
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-31 Thread Michael Radziej

Hi,

A few days ago, I wrote:
> I see three ways to fix the problem in #3370:
> 
> a) newforms stops passing unicode strings to the Database API and uses
> bytestrings.
> 
> b) the database wrapper in Django sets connection.charset (but needs to
> translate the charset name since the databases don't understand all
> charset name variants, see ticket #952 here). This is the approach of
> the patches in tickets #1356 and #3370.
> 
> c) the database wrapper in Djago must check whether it gets unicode. In
> this case, it needs to encode it into a bytestring.

I now see a fourth way that would resolve #952 at the same time:

d) make the database wrapper accept both unicode and bytestrings in
the models, but always pass unicode strings to the database backend.

Details:

For #952 to work, the name of the character encoding has to be
translated from python naming conventions to these of the used
backend, and this would need a huge table (see the ticket). It looks
easy, but it's a major annoyance.

Now, instead of doing this, how about modifying the database wrapper
so that it actually tests whether it gets unicode or bytestrings,
and in the case of bytestrings, decodes it to unicode using
settings.CHARACTER_SET as encoding? Then it could use unicode to
talk to its backend. As far as I see, psycopg2 is unicode capable,
and python-MySQLdb, too.

This is different from the proposal in the thread 'Unicode or
Strings in Models', as I'd still accept both forms in the model and
deal with it only when I send it to the database. 'Only unicode in
models' would be a major change with many scattered pieces. My
proposal is for a transition phase, to support piece-wise conversion
to Unicode without breaking everything on the way (as newforms does).

Disadvantage: The backend will probably decode it again to get it
across the wire, to either UTF-8 or settings.DEFAULT_CHARSET (or
something else), adding overhead to the database communication.

I think this is a necessary transition from bytestrings to the Great
Unicodification of Everything. As soon as there's unicode
everywhere, the code that deals with bytestrings can be removed and
the solution will fit in perfectly.


What do you think?

Michael


-- 
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100

http://www.noris.de - The IT-Outsourcing Company

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-30 Thread Ivan Sagalaev

Bill de hOra wrote:
> Yep; it's a problem on the way back in. Python won't let you interpolate 
> encoded bytestrings and unicode; you have to state the encoding. Ivan - 
> could the db encoding be declared in settings.py?

This is what #952 is about. Though it doesn't convert things for DB on 
Django side, it declares Django's data encoding to DB instead so it can 
convert.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-30 Thread Bill de hOra

Ivan Sagalaev wrote:
> Michael Radziej wrote:
>>
>> I don't see a problem with the generic views since they pass bytestrings
>> to the database wrapper, this gets as bytestrings to MySQLdb, and for
>> bytestrings the charset attribute is not used at all.
> 
> Umm... This is the exact problem with byte strings: that they require
> knowledge of a charset somewhere.

Yep; it's a problem on the way back in. Python won't let you interpolate 
encoded bytestrings and unicode; you have to state the encoding. Ivan - 
could the db encoding be declared in settings.py?

cheers
Bill

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-30 Thread Ivan Sagalaev

Michael Radziej wrote:
> I thank you for all your patience with me. I was completely off-track. I
> read all the mails again, and everything is starting to make sense now.

Then I hope not to confuse you (and everyone else) with my answer :-)

> First, contrary to my former opinion, #3370 is a bug in the newforms
> module, as it is passing unicode to the database API which is not ripe
> for it and will break as soon as you leave ASCII.

I wouldn't call it a bug. Newforms are intended to work in unicode. They 
don't play nice with db backends now but it's a question what should be 
changed: newforms to supply byte strings or db backends to accept unicode.

> I see three ways to fix the problem in #3370:
> 
> a) newforms stops passing unicode strings to the Database API and uses
> bytestrings.
> 
> b) the database wrapper in Django sets connection.charset (but needs to
> translate the charset name since the databases don't understand all
> charset name variants, see ticket #952 here). This is the approach of
> the patches in tickets #1356 and #3370.
> 
> c) the database wrapper in Djago must check whether it gets unicode. In
> this case, it needs to encode it into a bytestring.

I believe option a) and b) together will do the work.

Now we have all these confusing bugs because db backends receive two 
kind of inputs: unicode from newforms and byte strings from oldforms (a 
majority of existing code I think). Newforms are now "guilty" of 
introducing unicode into party so I think it's better to keep all the 
conversions there.

Option b) is needed because a db backend should know in which 
single-byte encoding it receives data. The great advantage of unicode is 
that you shouldn't supply a text's language alongside, it's encoded 
right there. But with byte strings it's necessary.

Option c) scares me :-). Because the need in working with byte strings 
(and hence in options a) and b)) remains but also introduces an ability 
to accept but not to issue unicode objects also. I don't think people 
would thank us for this :-)

> With all three variants, what encoding should be used? We currently
> issue (without #952) a 'set name utf8' at the beginning of each
> connection, so the database server expects to receive utf8. So,
> shouldn't we currently always use utf8 encoding, regardless of what is
> in settings.DEFAULT_CHARSET?

No we shouldn't. In fact this was never working properly, #952 is an old 
bug. It kinda works most of the time because the default value of 
DEFAULT_CHARSET is 'utf-8' and most apps don't change it. But if they do 
  and actually work with non utf-8 data then when fed into database 
declared as utf-8 they will break because an arbitrary single-byte 
encoding is not well-formed utf-8.

Databases react differently: Postgres complains that it's not utf-8 and 
refuses to accept garbage (I love Postgres :-) ). MySQL, at least some 
versions, just won't check the encoding and store data as a byte array. 
Sorting and case insensitivity won't work but at least you can SELECT 
everything back unchanged which supports the notion that it "works" :-). 
Actually this means that #3370 is safe to include because it's 
MySQL-only, doesn't affect byte strings at all because of MySQL's 
liberal interface and actually fixes a bug when it receives unicode from 
newforms. I'm against it only because it creates this incomprehensible 
mess of conventions and edge cases neutralizing each other... #952 is 
just a more general way of doing things.

> Well, the current patch in #3370 (I still ignore __repr__) only changes
> the charset attribute of a connection, and this attribute is used only
> to encode unicode strings when sending data to the database, or to
> decode bytestrings received from the database when MySQLdb is configured
> to produce unicode ('use_unicode').

BTW I'm -1 on switching backends to unicode right now because:

1. We should manually decode/encode for backends that can't do it (say, 
psycopg1)

2. We immediately get __str__'s returning unicode objects which will 
open a can of worms of confusions (and flame wars :-) ).

> I don't see a problem with the generic views since they pass bytestrings
> to the database wrapper, this gets as bytestrings to MySQLdb, and for
> bytestrings the charset attribute is not used at all.

Umm... This is the exact problem with byte strings: that they require 
knowledge of a charset somewhere.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-29 Thread Michael Radziej

Hi there,

I thank you for all your patience with me. I was completely off-track. I
read all the mails again, and everything is starting to make sense now.
This is going to be a lengthy email about #1356 and #3370, but please do
read until the end. Short executive summary: It's really a bug, and the
patch is not bad, but incomplete.

First, contrary to my former opinion, #3370 is a bug in the newforms
module, as it is passing unicode to the database API which is not ripe
for it and will break as soon as you leave ASCII. #3370 is independent
of #952.


I see three ways to fix the problem in #3370:

a) newforms stops passing unicode strings to the Database API and uses
bytestrings.

b) the database wrapper in Django sets connection.charset (but needs to
translate the charset name since the databases don't understand all
charset name variants, see ticket #952 here). This is the approach of
the patches in tickets #1356 and #3370.

c) the database wrapper in Djago must check whether it gets unicode. In
this case, it needs to encode it into a bytestring.


With all three variants, what encoding should be used? We currently
issue (without #952) a 'set name utf8' at the beginning of each
connection, so the database server expects to receive utf8. So,
shouldn't we currently always use utf8 encoding, regardless of what is
in settings.DEFAULT_CHARSET? This point has caused a lot of confusion.

Ivan wrote:

> I'm -1 on setting MySQL connection to 'utf8' in #3370. It *will* make
> sense when we will have newforms ready and models containing unicode.
> But now most of Django is a byte string country. A bright example are
> generic views that take data from web and store it to models without any
> conversions. This patch will feed 'windows-1251' or 'iso-8859-1' to
> MySQL saying that "it's utf-8" and MySQL will try to convert it and most
> certainly will store just strings of ''.

Well, the current patch in #3370 (I still ignore __repr__) only changes
the charset attribute of a connection, and this attribute is used only
to encode unicode strings when sending data to the database, or to
decode bytestrings received from the database when MySQLdb is configured
to produce unicode ('use_unicode'). Here's what the documentation in
MySQLdb-1.2.2b2 says:

 use_unicode
If True, CHAR and VARCHAR and TEXT columns are returned as
Unicode strings, using the configured character set. It is
best to set the default encoding in the server
configuration, or client configuration (read with
==> read_default_file).  If you change the character set after
==> connecting (MySQL-4.1 and later), you'll need to put the
==> correct character set name in connection.charset.

If False, text-like columns are returned as normal strings,
but you can always write Unicode strings.

*This must be a keyword parameter.*

(But, the charset parameter is also used when you pass in unicode
without setting use_unicode)

python-MySQLdb-1.2.1p2 is similar, only that there it is no keyword
parameter. There's an interesting difference between 1.2.1p2 and
1.2.2b2: For 1.2.1p2, you have to change the charset attribute of the
existing connection. If you try this on 1.2.2b2, it won't work. For
1.2.2b2, you either have to pass a 'charset' parameter when you create
the connection, or you can call a method set_character_set(). Both of
these won't work for 1.2.1p2, of course :-(

So, the APIs of python-MySQLdb are incompatible with each other (within
a minor version change!) This explains the differences between #1356 and
#3370. We need a patch that plays well with both versions of python-MySQLdb.

I don't see a problem with the generic views since they pass bytestrings
to the database wrapper, this gets as bytestrings to MySQLdb, and for
bytestrings the charset attribute is not used at all.

Of course, as soon as #952 has been applied, we need to use the encoding
from settings.DEFAULT_ENCODING.


Michael


P.S.:

If you set the charset parameter in 1.2.2b2's Connection.__init__(), the
default for use_unicode will be True, and python-MySQLdb will return
unicode strings.



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-28 Thread Ivan Sagalaev

[EMAIL PROTECTED] wrote:
> I think the next step in the unicodeification of django is to decide where 
> the conversions happen. Or has this already been decided?
> 
> I like the picture of "unicode circle of trust": everything inside the circle 
> is trusted as unicode strings. Everything outside has to be encoded/decoded.
> It's pretty clear the database is outside, the http gets/posts are outside 
> too. But what about templates? What about settings/views/models?
> I guess if that is decided, we can have a "unicode roadmap". I guess there 
> are a few people who have spare time and knowledge to help django become 
> unicode.

Yes this was discussed and resolved pretty much like you described: 
everything is in unicode except the Web and the database. The roadmap is 
here: http://code.djangoproject.com/wiki/UnicodeInDjango

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-28 Thread Ivan Sagalaev

Michael Radziej wrote:
> Hey, I now finally understand why you need #952 as soon as you switch to
> a different charset. I understand your point, but I'd rather offer a
> solution than postponing this for such a long time.

+1

#952 is good to include now since it plays nice with byte string models 
that we have now.

Newforms issue, that they can't be automatically dumped to models, is 
also really a separate thing.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-28 Thread Michael Radziej

Hi,

Ivan Sagalaev schrieb:
> Michael Radziej wrote:
>> I'm not sure about what the last sentence means--are you suggesting to
>> put #3370 (the mysql part) into "Needs design decision"?
> 
> ## 3370
> 
> I'm -1 on setting MySQL connection to 'utf8' in #3370. It *will* make 
> sense when we will have newforms ready and models containing unicode. 
> But now most of Django is a byte string country. A bright example are 
> generic views that take data from web and store it to models without any 
> conversions. This patch will feed 'windows-1251' or 'iso-8859-1' to 
> MySQL saying that "it's utf-8" and MySQL will try to convert it and most 
> certainly will store just strings of ''. The patch is working for 
> the author only because it feeds newforms' unicode objects right into 
> models which is wrong (we hadn't convert models to unicode yet).

Ah, I see. Somehow it wasn't clear to me that POSTs and GETs are just
passed along, but now that you mention it, it looks so obvious. Thanks,
that was the missing piece that kept me from proper understanding.

> But the __repr__ part is plain incorrect:

Now, let's keep __repr__() apart, it's a different issue. We can come
back to it later.

> ## 952
> 
> This patch tries to set connection encoding to the one used for web: 
> DEFAULT_CHARSET. But when we convert Django to unicode (we'll have to do 
> it anyway because of newforms) this won't be necessary because models 
> will be unicodified too. Then it'll make sense to set 'utf8' in all 
> backends as a connection encoding.

Hey, I now finally understand why you need #952 as soon as you switch to
a different charset. I understand your point, but I'd rather offer a
solution than postponing this for such a long time.

> ## Suggestion
> 
> Now I think we should close all these bugs. Don't laugh (or cry)! #952 
> is neither long-term nor helps ak's case, #3370 is broken (sorry, ak, 
> but it is) and #1356 is a dupe of #3370.

I agree to close #1356 and #3370, but #952 seems to be valuable
independent of ak's case.

I'd rather put #952 into "Needs design decision", because that's really
the realm of the core to decide, but it looks a bit that Adrian has
already accepted it (as he reopened it). SmileyChris, you did the
initial triage on #952, do you read me? What's your opinion here?

As I said, this is __repr__() kept aside. Let's tackle it after the
connection encoding.

Michael


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-28 Thread [EMAIL PROTECTED]

Ok, thanks for that Ivan,

Michael - ignore what I said before :-).

The real question, then, is what will it take to get Django unicode 
uh, "safe" (not sure if that's the best term) before 1.0. I realise 
that this looks like it's going to be fairly major to sort out, but if 
we don't then we're going to have all sorts of irritating little bugs 
like these ones popping up repeatedly.

--Simon


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-28 Thread Ivan Sagalaev

Michael Radziej wrote:
> I'm not sure about what the last sentence means--are you suggesting to
> put #3370 (the mysql part) into "Needs design decision"?

## 3370

I'm -1 on setting MySQL connection to 'utf8' in #3370. It *will* make 
sense when we will have newforms ready and models containing unicode. 
But now most of Django is a byte string country. A bright example are 
generic views that take data from web and store it to models without any 
conversions. This patch will feed 'windows-1251' or 'iso-8859-1' to 
MySQL saying that "it's utf-8" and MySQL will try to convert it and most 
certainly will store just strings of ''. The patch is working for 
the author only because it feeds newforms' unicode objects right into 
models which is wrong (we hadn't convert models to unicode yet).

But the __repr__ part is plain incorrect:

 try:
 return '<%s: %s>' % (self.__class__.__name__, self)
 except UnicodeEncodeError:
 return '<%s: %s>' % (self.__class__.__name__, 
self.__str__().encode(settings.DEFAULT_CHARSET))

The __str__().encode(...) is wrong because it's already 'str' and you 
can't encode it any further.

It was working for patch author because he had __str__ of a model 
returning a unicode object. It's wrong and it should be fixed after the 
whole unicodfication of Django. But patching it this way will break 
perfectly normal code where people don't assign unicode objects to model 
properties. Granted, the breakage won't be very bad because people don't 
show __repr__ to users often. But it's still bad.

## 952

This patch tries to set connection encoding to the one used for web: 
DEFAULT_CHARSET. But when we convert Django to unicode (we'll have to do 
it anyway because of newforms) this won't be necessary because models 
will be unicodified too. Then it'll make sense to set 'utf8' in all 
backends as a connection encoding.

## Suggestion

Now I think we should close all these bugs. Don't laugh (or cry)! #952 
is neither long-term nor helps ak's case, #3370 is broken (sorry, ak, 
but it is) and #1356 is a dupe of #3370.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-28 Thread Ivan Sagalaev

ak wrote:
> Bjorn, if you read my first messages and specially my patch #3370, you 
> find that I made a suggestion that if the guys want to move to unicode 
> they better drop all native encodings support and so does my patch.

With all due respect, you seem to not understand this. 'Unicode' does 
not mean 'dropping native encodings support'. This is just FUD.

Your patch in #3370 is broken (as I showed to you in personal mail) 
because it 'encodes' __str__ which works only for your special case 
where you assign a unicode object to a model property and return it from 
__str__.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-28 Thread Michael Radziej

Hi Simon,

[EMAIL PROTECTED]:
> +1
> 
> I was just coming to the same conclusion - #952 looks good to go, and 
> #3370 could be split into the __repr__ and mysql issues. __repr__ and 
> #952 are easy to solve. The rest of it needs the cores to come to a 
> decision about this.

I'm not sure about what the last sentence means--are you suggesting to
put #3370 (the mysql part) into "Needs design decision"?

Michael



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-28 Thread [EMAIL PROTECTED]

+1

I was just coming to the same conclusion - #952 looks good to go, and 
#3370 could be split into the __repr__ and mysql issues. __repr__ and 
#952 are easy to solve. The rest of it needs the cores to come to a 
decision about this.

--Simon


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-27 Thread Michael Radziej

Hi,

ak schrieb:
> After some thoughts I came to the following conclusion: if you guys 
> want to keep support of legacy charsets in fact you don't have to 
> force model objects too be unicoded. Firstly, they are passed to 
> templates and filters and we can't mix legacy charsets with unicode in 
> one template. Next, if I don't use unicode, I don't have to code my 
> python sources (views) in unicode. So, I need to be able to pass 
> string values into my model objects and my strings are not unicoded.
> 
> So if everyone agreed, the way is simple:
> 1. when django loads data from db and fills in a model object, all 
> strings have to be encoded according to DEFAULT_CHARSET
> 2. when django passes data from form object to model object, it has to 
> encode strings according to DEFAULT_CHARSET again

This thread is moving more and more away the tickets. I started it to
get some help in deciding how to proceed with these ...

Regarding ak's proposal, this is going against a widely shared agreement
within the python world that applications should internally use unicode
strings (not: utf8 strings) and decode/encode to a bytestring at the
boundaries, which is usually input/output, or for database applications
it's the communication between the database backend (e.g. MySQLdb) and
the database. I'm not in a position to make any decisions for django,
but I'm pretty sure that you cannot convince the core developers to
follow your path.

Down to earth and back to tickets, my current understanding is this:

The problem that started the original thread in django-users was that
the MySQLdb backend thought it was using latin-1 encoding for the
connection and therefore could not encode '€', which is in iso-8859-15
but not in iso-8859-1 aka iso-latin-1. Ticket #2896 seems to explain how
this can happen.

In my opinion, each of the three tickets in the subject should solve
this issue, and none tries to cope with templates written in a different
encoding than settings.DEFAULT_ENCODING.

#952 allows to use a different encoding on the connection than
settings.DEFAULT_CHARSET. It does it for all backends.

#1365 sets connection.charset in the mysql backend to utf8. This makes
the MySQLdb use utf8 encoding, but it's hackish and has been reported
not to work in all environments.

#3370 opens the mysql backend connection with charset='utf8', which
seems a cleaner way to do the same as #1365. It also fixes the __repr__
of models (not sure if this is the best way, but this can be added to
any of the other patches)

My bottom line is that #952 has a different scope than the other two
tickets, and that #1365 should be closed as duplicate of #3370. #3370
and #952 can co-exist.


So, would anybody object against closing #1365 and promoting #952 and
#3370 to "Accepted" (which was their state before we started this
discussion)?

Michael


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-27 Thread Bjørn Stabell

On Jan 28, 2:02 pm, "ak" <[EMAIL PROTECTED]> wrote:
> Bjorn, if you read my first messages and specially my patch #3370, you
> find that I made a suggestion that if the guys want to move to unicode
> they better drop all native encodings support and so does my patch.

You mean require all I/O edge/boundary points to convert to/from 
Python unicode strings?  (We'll of course need to support non-UTF 
character encodings in databases, files, the web, etc.)

> Then people started to answer me that this is wrong. And at the moment
> noone is able to explain the whole thing and answer my quesions:
> 1. how do they want to support templates and python code (views/
> scripts) in native encodings if django itself would be all in unicode.
> The only way i see is to encode/decode everything at programmer's end
> and this means for me no native encodings support at all.

Support for Unicode strings (u"") in code is described in PEP-263, 
e.g.,

  #!/usr/bin/python
  # -*- coding:  -*-

Unfortunately it's not implemented yet (AFAIK), so you can't just have 
unescaped literals:

  s = u"encoded text goes here" # doesn't work yet; pending 
PEP-263

An alternative for literals in code is to surround them with unicode() 
and specify the appropriate encoding:

  s = unicode("encoded text goes here", "encoding name")

An even better way is to externalize all strings in .po files and use 
gettext, which has some support for returning unicode strings.


I guess templates could have their character encoding identified 
either through a similar mechanism, through a global settings 
variable, or just use the system default encoding.


> 2. how do they want to support legacy databases if db connection speaks 
> unicode

I'm not sure I can follow you.  How to configure a database adapter 
depends on the database and adapter you're using.  Some can accept 
unicode strings; for those that don't I guess you'll need a wrapper of 
some sort.


Rgds,
Bjorn


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-27 Thread ak

After some thoughts I came to the following conclusion: if you guys 
want to keep support of legacy charsets in fact you don't have to 
force model objects too be unicoded. Firstly, they are passed to 
templates and filters and we can't mix legacy charsets with unicode in 
one template. Next, if I don't use unicode, I don't have to code my 
python sources (views) in unicode. So, I need to be able to pass 
string values into my model objects and my strings are not unicoded.

So if everyone agreed, the way is simple:
1. when django loads data from db and fills in a model object, all 
strings have to be encoded according to DEFAULT_CHARSET
2. when django passes data from form object to model object, it has to 
encode strings according to DEFAULT_CHARSET again

In fact, my patch #3370 is wrong then, actually newforms.model.save() 
method should be patched to recode clean_data from unicode to 
DEFAULT_CHARSET (if it differs) when passing this data to model object 
and for now we would get everything in place: utf8-based templates and 
legacy-charset-based templates would be both correctly supported and 
any national characters would be stored in db perfectly as they do now 
with oldforms (ofcourse remember what I said about #952)
And the second required patch is about recoding unicode strings loaded 
from db to DEFAULT_CHARSET (if differs) when passing them to model 
objects and back from DEFAULT_CHARSET to unicode when we save model 
objects to db. This patch will solve #952 issue and again it will work 
ok with both unicode and legacy-charset based templates.
And even more here: if we have a legacy database which doesn't 
understand unicode, we can realize this fact immediately after 
connecting to db and decide the correct way to decode/encode strings.

As I see, this way fixes all unicode/charsets issues and answers all 
questions. So, if there are no objections, I can write this patch 
tomorrow or by monday.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-27 Thread Ivan Sagalaev

Michael Radziej wrote:
> 1. Are all these tickets really about the connection encoding?
> 
> 2. If so, what's the problem of using utf8 for the connection for
> everybody? I don't see how this would be a problem for anybody who is
> using a different encoding for templates, within the database's storage
> or else, since there's no loss in converting anything into utf8. Or is
> there?

I agree with the 2nd point. You still can run into a theoretical problem 
with it in a scenario when an input is richer than a storage:

- a database that is internally stores data in a legacy encoding (say 
iso-8859-1)
- a web frontend that talks utf-8
- a user enters, say, Russian characters into a form
- data travels as utf-8 right until db where it will fail to encode them 
in iso-8859-1 because it doesn't have place for Russian characters

But it's indeed a very theoretical case. Most legacy system use the same 
legacy encoding for both backend and frontend and there would be no 
errors in the path: legacy (web) - unicode (newforms) - utf-8 (db 
connection) - legacy (db)

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-27 Thread Ivan Sagalaev

ak wrote:
> Could someone please explain me what was a problem with unicode support
> in oldforms so newforms have been made with unicode inside ?

I can! The thing is it has absolutely nothing to do with forms, it's 
just historical coincidence.

Originally Django was written with using byte strings everywhere and 
there were no such thing as "conversion problem". However there were 
problems with incorrect string operations on byte strings (maxlength 
counting, upper/lower casing, etc.) Some time ago there was a decision 
to convert Django to work internally with unicode strings and convert 
them into byte strings on boundaries to the web and to the database. And 
there were no such thing as newforms at that moment.

And then Adrian started to implement newforms and he has chosen to do 
its internal in unicode, for compatibility with Django's future as I 
understand it.

> Kick me if I wrong but what is a real reason to convert bytes back and
> forth ? Religion ?

Reasons are purely technical... I'll list them but please do read until 
the end of the letter before you disagree. I believe you just 
misunderstand some things about unicode.

1. Unicode is a universal encoding that can store all characters. 
Without universal encoding an app written by a Russian programmer 
wouldn't be able to use a library written by a French programmer. This 
is why we need unicode.

2. In Python unicode strings can be either 'unicode' objects or 
byte-strings encoded in utf-8. The problem with utf-8 is that you can't 
string operations with it. For example you can't cut a month's name to 3 
letter just by doing month[0:3] because letters can occupy different 
count of bytes. This is what unicode objects are for and why Django 
internally should work with unicode.

May I recommend you my post about unicode and bytes (it's in russian): 
http://softwaremaniacs.org/blog/2006/07/28/unicode-and-bytes/

> I agree with everyone who says that unicode is a
> must and 'legacy' charsets are crap but guys I already have a BIG
> application that was about 80% migrated from other python frameworks to
> django some time ago and for legacy reasons it was all in national
> charset, not unicode.

What gives you an idea that Django won't work with this data? All this 
unicode stuff is purely internal. If you want your app to output 
windows-1251, set DEFAULT_CHARSET to windows-1251 and data would be 
automatically converted from and to it. I believe even newforms already 
use this setting to convert unicode data for templates (if not it should 
be just fixed and I'm happy to make a patch since I got some free time).

> Then I found that oldforms support will be
> dropped soon or later. So we at here have decided to start moving (yes,
> moving again !!!) all our code to newforms and what we got ? We got
> that we now have to recode everything to utf-8

Sure not :-). I'd say it would be wise thing to do *eventually*. But for 
now you absolutely can keep your templates and python sources in 
windows-1251.

> Did anyone who used unicode with oldform has any problems ? I am sure
> noone did.

In fact nobody used unicode with old forms. All things in request.POST, 
manipulator.flatten_data and in db models were always in byte strings 
(except db models with psycopg2).

And there were problems with it. They were just fixed very early (a 
couple of them by yours truly).

> So guys please explain me what was a reason to make me to migrate to
> unicode ?

I still think that you're confusing migrating Django internals to 
unicode objects and converting your files to utf-8. It's not about the 
latter.

> My opinion is simple: let's decide once ether django is for unicode or
> django supports both unicode and national charsets and then let's work.

Sure Django does and will support national charsets. This is why we have 
DEFAULT_CHARSET setting. Internal unicode just lets Django have all the 
encode/decode stuff localized in two places instead of littered all over 
the code.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-27 Thread ak

Michael, of you read again the topic about euro sign in newforms you
can find that this touches everything. Personally I couldn't find a way
to use utf-8 to connect MySQL and keep using cp1251 in my templates: it
basically doesn't work. With my patch (#3370) and utf8 everywhere it
does.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-27 Thread ak

Guys

Could someone please explain me what was a problem with unicode support
in oldforms so newforms have been made with unicode inside ?
Kick me if I wrong but what is a real reason to convert bytes back and
forth ? Religion ? I agree with everyone who says that unicode is a
must and 'legacy' charsets are crap but guys I already have a BIG
application that was about 80% migrated from other python frameworks to
django some time ago and for legacy reasons it was all in national
charset, not unicode. Then I found that oldforms support will be
dropped soon or later. So we at here have decided to start moving (yes,
moving again !!!) all our code to newforms and what we got ? We got
that we now have to recode everything to utf-8 and search for bugs in
over than 10k lines of our oldforms-based code until we move everything
to newforms and utf-8. But really why ?
Did anyone who used unicode with oldform has any problems ? I am sure
noone did.
Did anyone who used native encodings with oldforms has any problems
(except of patch against one line of code I dscribed before or #952) ?
Noone did.

So guys please explain me what was a reason to make me to migrate to
unicode ?

Django is a web framework for perfectionists with deadlines. I see may
perfectionists here but what about deadlines ?

My opinion is simple: let's decide once ether django is for unicode or
django supports both unicode and national charsets and then let's work.
If you tell me that from now there is only "unicode future" i'd agree
and start searching for bugs and sending patches like  #3370


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-26 Thread Julian 'Julik' Tarkhanov


On Jan 26, 2007, at 11:47 AM, Michael Radziej wrote:

> # 1356 sets the charset attribute of the mysql backend connection to
> 'utf8' for mysql version >= 4.1

And leaves everyone who wants to operate in 8 bits out in the cold.  
Where they actually ought to be anyway, but I tried to stay liberal  
in 952 - primarily because
it's still unknown how Django authors want to approach this.


-- 
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-26 Thread Julian 'Julik' Tarkhanov


On Jan 26, 2007, at 2:25 PM, Gábor Farkas wrote:

>
> Julian 'Julik' Tarkhanov wrote:
>>
>>
>> Python's unicode is actually UTF-16
>
> on linux it's usually utf-32, and on windows it's usually (always?)  
> utf-16.
sorry I forgot that - it's been a year at least since I last touched  
Python (actually it was
for the Django test drive)
>
> but you should not care about it. you see, in python,
> the unicode-strings are a separate data-type, and there's
> just no way to take a bytestring, and tell python: "from now on,
> you are an unicode-string, because i know that you are encoded in  
> utf-16."
segregating ustrings and strings is BBD, been' telling it for years.  
The latest I heard
is that the next major Py will abolish bytestrings for good.

Getting back to the issue that we were on, I am still strongly  
advocating the
"don't go there" approach for anything but Unicode. How it should be  
handled in relation to
source code is unknown to me (AFAIK Python has a pre-amble sort of  
declaration that you can actually use
to tell the interpreter which encoding your source is in). I just  
know you hit some major pain when you expect ustrings and
get bytestrings instead (and in Python, just as in Perl, only about  
30% of the libraries actually care about what they give you).

> so while it might be, that the conversion from utf-16-bytestrings to
> unicode is sometimes faster thatn converting from utf-8-bytestrings to
> unicode, you can't be sure, because as i wrote above, the internal
> unicode-encoding is not fixed.
>
>> whereas IO and the databases mostly
>> speak UTF-8 -
>> so no, you can't dump it over the wire.
>
>> We Rubyists are a tad happier
>> because we now
>> have all in UTF-8
>
> you mean that regexes, and all the methods of the string-class now are
> unicode-aware in ruby? :)

Regexes are unicode-aware for some time already except the case- 
sensitivity and the class repertoire (which will be fixed when  
Oniguruma is there). As for
the string methods, we mostly took care of them with AS::Multibyte  
(without silly subclassing) and that works wonders for me. The  
greatest advantage is that I never
have to check what's coming down the pipe because there's only one  
String to rule them all.
-- 
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-26 Thread Gábor Farkas

Julian 'Julik' Tarkhanov wrote:
> 
>  
> Python's unicode is actually UTF-16 


sorry, but no. it's not utf-16.

it's decided at compile-time,
and i'ts either utf-32 or utf-16.

on linux it's usually utf-32, and on windows it's usually (always?) utf-16.

but you should not care about it. you see, in python,
the unicode-strings are a separate data-type, and there's
just no way to take a bytestring, and tell python: "from now on,
you are an unicode-string, because i know that you are encoded in utf-16."

the way it works is that you take a bytestring,
and ask python to convert it into an unicode-string (and you also have 
to tell python the bytestring's charset).

so while it might be, that the conversion from utf-16-bytestrings to 
unicode is sometimes faster thatn converting from utf-8-bytestrings to 
unicode, you can't be sure, because as i wrote above, the internal 
unicode-encoding is not fixed.

> whereas IO and the databases mostly
> speak UTF-8 -
> so no, you can't dump it over the wire.

> We Rubyists are a tad happier
> because we now
> have all in UTF-8

you mean that regexes, and all the methods of the string-class now are 
unicode-aware in ruby? :)

gabor

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-26 Thread Ivan Sagalaev

ak wrote:
> Ticket http://code.djangoproject.com/ticket/952 contain a complete
> solution of this problem and I don't know why it was not merged into
> the code but at the moment it is not matter and here is the reason why:
> Since newforms library was born and the decision about using unicode
> for clean_data was made, all these patches became unnecessary

Not at all. Anton, read my summary that I posted as a reply to Michael 
first post. Specifying database encoding and keeping internals in 
unicode are two separate issues. #952 is still necessary but not enough 
to fix your bug.

> because
> now developers must use only unicode everywhere (templates, db etc)

Actually the shouldn't :-). Newforms is now the only part of Django that 
works with unicode. I/O with th web (requests and templates) are now 
hotfixed to work with it in a way. Databases aren't.

> or
> manually recode all forms based on newforms from unicode to native
> encoding and back. Ofcourse this is stupid

May be it is. But it's a temporary inconvenience of newforms. Later 
database backend should do this automatically by using either 'utf-8' or 
DATABASE_CHARSET as I described in that my message.

BTW, there were ideas here about really really forcing users to migrate 
all data into unicode/utf-8 and be the first guy on the block that would 
lead the trend. This is noble but hard and if I remember correctly this 
was decided against...

> So, for me the quesion sounds like this: either newforms don't use
> unicode to store clean_data and we can keep using 'legacy' character
> sets, or django needs to drop all charsets support except of unicode.
> Or it should convert strings back and forth everywhere LOL

Incidentally you last 'LOL' is the option that Django have chosen :-). 
I'll try to explain.

'Unicode' is not a charset, or, more specifically, it is not represented 
with bytes. Python's native unicode string represent unicode characters 
in some internal format that just can't be dumped over the wire, be it 
to database or to the web. Because of this if Django would work 
internally in unicode it must encode everything it writes and decode 
everything it reads from outside. Converting from unicode to utf-8 is 
also encoding, and it does not happen automatically.

When you say that db backend supports 'unicode' it actually means that 
db library under Django backend does the encoding itself. But whether 
it's done in the library or in Django backend we still need a setting 
for charset. Two settings actually: for the web (that we already have) 
and for db (that is implemented in #952).

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-26 Thread Michael Radziej

Hi,

here's a summary what the different tickets are about:

# 952 adds a database client encoding setting,
DATABASE_CLIENT_CHARSET, for mysql and postgresql backends. For
mysql, it uses the given charset in 'SET NAMES' to build the
connection, except for mysql < 4.1. For postgresql, it does a 'SET
CLIENT_ENCODING TO'.

# 1356 sets the charset attribute of the mysql backend connection to
'utf8' for mysql version >= 4.1

# 3370 starts by explaining a traceback within newforms when you use
utf8-encoded values with a form created by form_for_instance and has
a patch that adds 'charset':'utf8' to the kwargs used in
Database.connect() within DatabaseWrapper.cursor()


Michael Radziej

-- 
noris network AG - Deutschherrnstraße 15-19 - D-90429 Nürnberg -
Tel +49-911-9352-0 - Fax +49-911-9352-100

http://www.noris.de - The IT-Outsourcing Company

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-26 Thread ak

Guys

The problem is simple but it was born a very long time ago.
For MySQL 4.1 and higher there is hardcoded in
django/db/backends/mysql/base.py:
cursor.execute("SET NAMES 'utf8'")
there were lots of tickets and messages in django-users complaining to
this but in fact they all were ignored.
Personally my company used to use patched django installation where
this line was replaced to:
cursor.execute("SET NAMES 'cp1251'")
because all our templates were (and still are in the production
environment) in windows-1251 encoding so we have had to use cp1251 to
deal with db.
Ticket http://code.djangoproject.com/ticket/952 contain a complete
solution of this problem and I don't know why it was not merged into
the code but at the moment it is not matter and here is the reason why:
Since newforms library was born and the decision about using unicode
for clean_data was made, all these patches became unnecessary because
now developers must use only unicode everywhere (templates, db etc) or
manually recode all forms based on newforms from unicode to native
encoding and back. Ofcourse this is stupid and noone will do it because
it's easier to migrate to utf-8 and forget about the problem.

So, for me the quesion sounds like this: either newforms don't use
unicode to store clean_data and we can keep using 'legacy' character
sets, or django needs to drop all charsets support except of unicode.
Or it should convert strings back and forth everywhere LOL

Any other opinions ?


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---



Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

2007-01-26 Thread Ivan Sagalaev

Michael Radziej wrote:
> Hi,
> 
> we have a bit of chaos here ... Tickets 3370, 1356 and probably 952
> all are about this problem, all are accepted, and #3370 and #1356
> have very similar patches. I ask everybody to continue discussion
> here in django-developers, and I ask the authors of these three
> tickets to work together to find out how to proceed.

Right :-). I'll generalize my comment in #3370 here.

There are, in fact, two separate issues.

1.  First one (that #952 was intended to fix) is that we don't have a 
notion of a database internal encoding at all. This is bad because DB is 
as external to Django as the web and it can be in any encoding.

 Then there are two ways of dealing with it:

 - let Django encode data into a charset that a database expects
 - tell a database which encoding Django uses and let it to encode
   data into its internals

 #952 is implemented as a second variant and it looks like it works 
(in fact author of it is Julian Tarkhanov -- a well known unicode expert 
and advocate in russian blogosphere.. just giving credits :-) )

 We really should have this thing regardless of Django's unicode or 
byte-string internals.

2. The second issue is an automatic conversion of unicode data for db 
backends that don't understand unicode. It's become relevant recently 
because people started to use newforms. If we accept #952 as it is then 
this should be resolved be encoding things into 'utf-8' inside backends. 
If we chose to reimplement database encoding support on django side then 
backend should encode into whatever encoding is stored in 
DATABASE_CHARSET setting.

This is what things are like now.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~--~~~~--~~--~--~---