Re: django unicode-conversion, beginning

2006-08-25 Thread Victor Ng

Hi gabor,

I've put up some patches to help with the unicode conversion of
django. We have a site which is shortly going to production where we
actually have to handle multiple unicode scripts including some which
have characters that do not fall into iso-8859-1.

Since I'm pretty lazy and I'm not really interested in maintaining my
own set of unicode patches against django forever - I'm *very*
interested in helping with any effort to get Django to support
unicode.

Adrian - can we get that branch opened up soon?

vic

On 8/21/06, gabor <[EMAIL PROTECTED]> wrote:
>
> Adrian Holovaty wrote:
> > On 8/8/06, gabor <[EMAIL PROTECTED]> wrote:
> >> i think unicodizing django can be done in 4 easily separated steps/parts:
> >>
> >> 1. request/response
> >> 2. templating-system
> >> 3. database-system
> >> 4. "overall unicode-conversion". this is mostly about replacing
> >> bytestrings with u"bla" in the code, and switching __str__ to __unicode__
> >>
> >> my biggest problem currently is, that i do not know how to continue...
> >> should i just write more and more patches to increase the
> >> "unicode-coverage" to more parts of django? or maybe a more coordinated
> >> approach would be better?
> >
> > Hey gabor,
> >
> > Sorry for the slow response on this -- I'm just now wading through a
> > couple of weeks' worth of django-users and django-developers messages.
> > This patch is a great step forward!
> >
> > Are you interested in a Subversion branch devoted to Unicoding Django?
> > Let me know...
> >
>
> (to make sure my original response is not caught up in a spam-filter or
> such, sending this to the list too)
>
>
> hi,
>
>
> yes, i'm interested :)
>
> cannot really promise how long it will take to convert the whole django
> to unicode, but will try. it's not hard. as i wrote, the changes are
> simple, it's just that many changes have to be done.
>
>
> thanks,
> gabor
>
> >
>


-- 
"Never attribute to malice that which can be adequately explained by
stupidity."  - Hanlon's Razor

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-21 Thread gabor

Adrian Holovaty wrote:
> On 8/8/06, gabor <[EMAIL PROTECTED]> wrote:
>> i think unicodizing django can be done in 4 easily separated steps/parts:
>>
>> 1. request/response
>> 2. templating-system
>> 3. database-system
>> 4. "overall unicode-conversion". this is mostly about replacing
>> bytestrings with u"bla" in the code, and switching __str__ to __unicode__
>>
>> my biggest problem currently is, that i do not know how to continue...
>> should i just write more and more patches to increase the
>> "unicode-coverage" to more parts of django? or maybe a more coordinated
>> approach would be better?
> 
> Hey gabor,
> 
> Sorry for the slow response on this -- I'm just now wading through a
> couple of weeks' worth of django-users and django-developers messages.
> This patch is a great step forward!
> 
> Are you interested in a Subversion branch devoted to Unicoding Django?
> Let me know...
> 

(to make sure my original response is not caught up in a spam-filter or 
such, sending this to the list too)


hi,


yes, i'm interested :)

cannot really promise how long it will take to convert the whole django 
to unicode, but will try. it's not hard. as i wrote, the changes are 
simple, it's just that many changes have to be done.


thanks,
gabor

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: Re: django unicode-conversion, beginning

2006-08-20 Thread James Bennett

On 8/20/06, Malcolm Tredinnick <[EMAIL PROTECTED]> wrote:
> Metaphorically cutting off both our arms so that we appear
> more aerodynamic is probably not a gain worth making.

That's going in my quotes file.

-- 
"May the forces of evil become confused on the way to your house."
  -- George Carlin

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-20 Thread Ivan Sagalaev

Malcolm Tredinnick wrote:
> Metaphorically cutting off both our arms so that we appear
> more aerodynamic is probably not a gain worth making.

This is the explanation! :-)

>> 5. Internally, work with unicode strings exclusively (after  
>> transcoding the request and the template). Response should be  python  
>> unicode as well up until the moment it gets sent out.
> 
> That's the idea.

It really works like this already by accepting unicode and also StringIO 
buffers with unicode.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-20 Thread Malcolm Tredinnick

On Sun, 2006-08-20 at 07:15 +0200, Julian 'Julik' Tarkhanov wrote:
> 
> On 17-aug-2006, at 1:08, Bill de hÓra wrote:
> 
> > like wanting to serve utf8 rss feeds, but have latin1 come
> > in and out of mysql.
> 
> Might seem very extreme, but I would love to chime in. Maybe it would  
> be wise to go even further, whereby:
> 
> 1. Hardcode Django to output and input UTF-8 as the most useful for  
> interop

Huge -1.

This stuff (output encoding) has to be configurable, it's the way the
Internet works. Sure, there are a bunch of cases where the specs will be
inconclusive or ignored, and then we will need to make inspired choices,
just like every other data-consuming, network-based application. But the
whole planet has not standardised on UTF-8 and with valid reasons.

It's also not that hard to get right, albeit fairly fiddly. You identify
the interfaces between external data and Django and do the conversion to
unicode as soon as you can. That's the process Gabor is going through at
the moment. Metaphorically cutting off both our arms so that we appear
more aerodynamic is probably not a gain worth making.

> 1a. Any case where the developer might expect different input (for  
> instance almost all OPML files are still exported as ISO due to  
> idyosyncrastic way Radio worked back in the day) has to be known to  
> him and handled explicittly
> 1b. Honor the charset headers sent in the request for transcoding
> 1c. Allow everyone who wants to output other charsets to cry and perish.
> 2. Stick the utf-8 output charset anywhere where it's possible  
> (headers, page head...).

Since non-UTF-8 encodings are the norm in a lot of East-Asian locales
(both for cultural and technical reasons), this isn't going to work.

> 5. Internally, work with unicode strings exclusively (after  
> transcoding the request and the template). Response should be  python  
> unicode as well up until the moment it gets sent out.

That's the idea.

[...]
> I know, it seems so nice to be liberal and allow people to choose  
> their encoding but just too many situations prove that to be the  
> Wrong Choice.

Th combined citizenry of China, Japan and South Korea thank your for
your input, but respectfully point out that you are mistaken.

Regards,
Malcolm


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-19 Thread Julian 'Julik' Tarkhanov


On 17-aug-2006, at 1:08, Bill de hÓra wrote:

> like wanting to serve utf8 rss feeds, but have latin1 come
> in and out of mysql.

Might seem very extreme, but I would love to chime in. Maybe it would  
be wise to go even further, whereby:

1. Hardcode Django to output and input UTF-8 as the most useful for  
interop
1a. Any case where the developer might expect different input (for  
instance almost all OPML files are still exported as ISO due to  
idyosyncrastic way Radio worked back in the day) has to be known to  
him and handled explicittly
1b. Honor the charset headers sent in the request for transcoding
1c. Allow everyone who wants to output other charsets to cry and perish.
2. Stick the utf-8 output charset anywhere where it's possible  
(headers, page head...).
2. Allow the DB to be in another encoding for databases that support  
it. For instance, MySQL and Postgress will transcode the strings for  
the client on the fly, so you can do interop with them in UTF-8 even  
when they are in a different encoding.
3. Assume all templates are in UTF-8 as well because text editors  
have much more success dealing with it them that way. Transcode  
templates on read into unicode strings.
4. As a consequence of 1, let DEFAULT_CHARSET go. Too many choices  
really hurt here.
5. As a consequence of 1, deprecate the DATABASE_CHARSET I sent in as  
a patch and make it the default, so that all drivers switch their  
database clients to the most suitable Unicode form. SQLite has to be  
compiled with Unicode support, this has to be mentioned in the docs.
5. Internally, work with unicode strings exclusively (after  
transcoding the request and the template). Response should be  python  
unicode as well up until the moment it gets sent out.

Important to note is that every database driver has to be scrutinized  
for whether it returns unicode strings proper.

I know, it seems so nice to be liberal and allow people to choose  
their encoding but just too many situations prove that to be the  
Wrong Choice.

-- 
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-16 Thread Bjørn Stabell

In China GB18030 is required to be used by law, any most sites just
assume the browser uses that as the default, so they don't even specify
a character encoding.

Your likely setup for international web sites is to have Unicode in the
database (since databases have special support for it and it is a good
base encoding), but to serve up different encodings wherever UTF-8
proves problematic (for technical or legal reasons).

Hopefully, over time, there'll be less and less resistance to using
UTF-8.

Rgds,
Bjorn


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-16 Thread Bill de hÓra

gabor wrote:
> 
> currently my plan is to have the following behaviour:
> 
> 1. i assume that every GET/POST param comes in encoded as 
> settings.DEFAULT_CHARSET, and will decode it accordingly. if it fails, 
> then it fails.

Assuming "you got served" with settings.DEFAULT_CHARSET, then sure.


> 3. will assume the database is in DEFAULT_CHARSET
>   - maybe can we somehow ask the db for it's charset?

It would be a start.

 > so, what do you think?
 > or should we make it possible to have a system with mixed charsets?

I could imagine serving web content with one encoding, but lumping 
things in and out of the db with another.I guess people will need mixed 
encodings - like wanting to serve utf8 rss feeds, but have latin1 come 
in and out of mysql.  But so long as we sweep out bytestrings inside 
django for unicode objects, mixed i/o should be possible to add on later.

Would being able to spec the db char encoding via settings.py be a 
needed option, or is that even possible across databases?

cheers
Bill


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-16 Thread Jeremy Dunck

On 8/16/06, gabor <[EMAIL PROTECTED]> wrote:
> 3. will assume the database is in DEFAULT_CHARSET
> - maybe can we somehow ask the db for it's charset?

I think you really have to allow for different charset in the DB--
legacy integration, remember.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-16 Thread gabor

Jeremy Dunck wrote:
> On 8/16/06, Bill de hÓra <[EMAIL PROTECTED]> wrote:
>> Now. Most (all?) browser UAs sniff the content to second guess the media
>> type. They don't much pay attention to Content-Type (I think maybe IE
>> ignores it altogether). The problem for this example is they might be
>> doing something similar for character encodings declared on the form
>> page's GET request. Browsers do this because so much served content is
>> mislabelled (eg feeds served as text/html and video as text/plain).
> 
> IE doesn't totally ignore it.  I just does some horrible, wrong things
> while considering it.
> http://blogs.msdn.com/ie/archive/2005/02/01/364581.aspx
> http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp
> 
> Ian Hickson says contenttype is dead:
> http://ln.hixie.ch/?start=1144794177=1
> http://ln.hixie.ch/?start=1154950069=1
> 

hmmm.. sad to hear that.. but it hopefully does not affect the 
django-unicode issue too much...

currently my plan is to have the following behaviour:

1. i assume that every GET/POST param comes in encoded as 
settings.DEFAULT_CHARSET, and will decode it accordingly. if it fails, 
then it fails.
- might make an exception and in case of post-data check the 
content-type header of the request, whether it contains any charset stuff
-if you really-really-really need to do some crazy 
is-sent-as-foo-but-has-to-be-treated-as-bar, you can always use the 
raw-postdata and raw-getdata.

2. will render the template in DEFAULT_CHARSET

3. will assume the database is in DEFAULT_CHARSET
- maybe can we somehow ask the db for it's charset?

so, what do you think?
or should we make it possible to have a system with mixed charsets? 
(well, maybe having a different DB_CHARSET and a DEFAULT_CHARSET could 
work. maybe)

gabor

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-16 Thread Jeremy Dunck

On 8/9/06, gabor <[EMAIL PROTECTED]> wrote:
> hmmm.. are you sure that the situation with unicode-aware editors is so bad?
>
> could you name some non-unicode-aware editors?
> for me it seems that from notepad through vim to eclipse everything does
> unicode fine...

On Windows, I used UltraEdit, which is a very popular editor.  $25ish
with very nice features.
It claims to support unicode, but I've tested with it and it horribly
mangles anything but UTF-8.  Worse, you can open a UTF-8 file as
though it were ASCII, then save as unicode, causing double-encoding.

I hearby degree that all strings in computing should have a charset
associated with them.

...

Damn, it didn't work.

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-16 Thread Jeremy Dunck

On 8/16/06, Bill de hÓra <[EMAIL PROTECTED]> wrote:
> Now. Most (all?) browser UAs sniff the content to second guess the media
> type. They don't much pay attention to Content-Type (I think maybe IE
> ignores it altogether). The problem for this example is they might be
> doing something similar for character encodings declared on the form
> page's GET request. Browsers do this because so much served content is
> mislabelled (eg feeds served as text/html and video as text/plain).

IE doesn't totally ignore it.  I just does some horrible, wrong things
while considering it.
http://blogs.msdn.com/ie/archive/2005/02/01/364581.aspx
http://msdn.microsoft.com/workshop/networking/moniker/overview/appendix_a.asp

Ian Hickson says contenttype is dead:
http://ln.hixie.ch/?start=1144794177=1
http://ln.hixie.ch/?start=1154950069=1

Happily, Mark Pilgrim did a lot of the hard work by converting
Mozilla's charset detection routines to Python in support of his feed
parser.
http://chardet.feedparser.org/docs/how-it-works.html

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-16 Thread Bill de hÓra

Gábor Farkas wrote:

> for example, using this html file:
> 
> http://localhost:7000;>
> 
> 
> 
> (+ additional xhtml-headers, http-equiv-content-type=utf-8 etc)
> 
> firefox submits this:
> 
> 
> POST / HTTP/1.1
> Host: localhost:7000
> User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b1) 
> Gecko/20060601 BonEcho/2.0b1 (Ubuntu-edgy)
> Accept: 
> text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
> Accept-Language: en-us,en;q=0.5
> Accept-Encoding: gzip,deflate
> Accept-Charset: UTF-8,*
> Keep-Alive: 300
> Connection: keep-alive
> Cookie: sessionid=9f5f5a5c387a07dd6b7e4d34a04e38b9
> Content-Type: application/x-www-form-urlencoded
> Content-Length: 14
> 
> gabor1=farkas1
> =
> 
> so, in what charset is the POSTDATA?

I don't have good news for you.

If we are talking about HTML forms in this case - undefined. There's no 
charset attribute defined on the form. In that case the value is assumed 
to be "unknown" and clients can (not must) map this value as the 
character encoding that was used to send the html form.

You can't assume in ISO-8859-1 for a form as that only comes to as a 
default for text/* types.


> so, i agree with you, that if they do send it, we should honor it. but 
> they are not sending it (i assume they should send it in the 
> Content-Type header).

To spec? Then client UAs /must/ treat the Content-Type header as the 
authoritative declaration of the character encoding if there is a 
charset - it overrides *everything*. HTTP 1.1 and recent W3C findings 
are explicit on this.

Now. Most (all?) browser UAs sniff the content to second guess the media 
type. They don't much pay attention to Content-Type (I think maybe IE 
ignores it altogether). The problem for this example is they might be 
doing something similar for character encodings declared on the form 
page's GET request. Browsers do this because so much served content is 
mislabelled (eg feeds served as text/html and video as text/plain).

So the heuristic "browsers send content back in the encoding they 
receive it" can be assumed in, but you have to allow for cases where 
they are sniffing content and ignoring server directives. But, as a 
server implementor, my advice is to *always* send the Content-Type 
header and charset, and assume the data will be returned in that encoding.

In order to be as stateless as possible, that means serving all forms in 
the same encoding, and typically your best bet in that case is to serve 
as UTF-8. Serving latin1 might work also for cases where people are 
using keyboard shorts for things like my surname (I'd need to test this 
to be sure; all I can say after 10 years of shopping online is that it's 
been pot luck). For cut and pasted content from word, we'd need to 
transcode down from cp1252 to latin1.

cheers
Bill


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-16 Thread Gábor Farkas

Bill de hÓra wrote:
> gabor wrote:
> 
>> so what do you think about the following approach:
>>
>> try ascii-decoding
>> if fails, try utf8-decoding
>> if fails do iso-8859-1-decoding (this cannot fail).
>>
>> ?
> 
> Dumb question maybe. How do you know this encoding ladder will work?

it depends on how you define 'will work' :-)

it will not fail (every string can be decoded as iso-8859-1).

> 
>> but imho this should happen only in "special" cases like 
>> environ-variables.. for example in get/post params i would prefer to 
>> raise an exception when the data cannot be en/de-coded using the 
>> configured charset.
> 
> You'd need to honor charset parameters sent out of Django apps and sent 
> back by the client. A sensible default encoding to emit is UTF-8.

i would honor them if they would be sent :-)

for example, using this html file:

http://localhost:7000;>



(+ additional xhtml-headers, http-equiv-content-type=utf-8 etc)

firefox submits this:


POST / HTTP/1.1
Host: localhost:7000
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1b1) 
Gecko/20060601 BonEcho/2.0b1 (Ubuntu-edgy)
Accept: 
text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: UTF-8,*
Keep-Alive: 300
Connection: keep-alive
Cookie: sessionid=9f5f5a5c387a07dd6b7e4d34a04e38b9
Content-Type: application/x-www-form-urlencoded
Content-Length: 14

gabor1=farkas1
=

so, in what charset is the POSTDATA?


so, i agree with you, that if they do send it, we should honor it. but 
they are not sending it (i assume they should send it in the 
Content-Type header).

the only usable assumption i have found up to now is that the browsers 
sends the data back encoded in the submitting-html-page's charset.

or is there a better way?

gabor

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-15 Thread Bill de hÓra

Malcolm Tredinnick wrote:
> On Wed, 2006-08-09 at 21:51 +0200, gabor wrote:
> [...]
>> phew... the immortal 
>> how-tolerant-we-should-be-when-doing-unicode-conversion problems :-)
> 
> Agreed. This is much easier on my side of the fence (lobbing problems),
> than your side (solving them).
> [...]
> All that being said, you could start off implementing your list and go
> from there (although surely utf-8 decoding will also handle ASCII
> strings, so you could skip the first step).

These would be good rules to follow:

- use unicode objects internally, weed out encoded bytestrings.

- decode all loaded files and configuration into unicode; templates will 
be challenging.

- initially at least, add assertions enforcing the use of unicode 
parameters (crash when you see a bytestring being passed into unicode 
aware code or across applications)

- default encode to utf8 at server boundaries, modulo what Malcolm said 
about honoring charsets served out.

- default de/encode in and out of utf8 for storage inside databases; it 
might be not possible and it might require a declaration in settings.

- have the admin app strip out cp1252 to deal with cut and paste from 
windows; effbot has a dictionary that can be used for this.

cheers
Bill





--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-12 Thread Adrian Holovaty

On 8/8/06, gabor <[EMAIL PROTECTED]> wrote:
> i think unicodizing django can be done in 4 easily separated steps/parts:
>
> 1. request/response
> 2. templating-system
> 3. database-system
> 4. "overall unicode-conversion". this is mostly about replacing
> bytestrings with u"bla" in the code, and switching __str__ to __unicode__
>
> my biggest problem currently is, that i do not know how to continue...
> should i just write more and more patches to increase the
> "unicode-coverage" to more parts of django? or maybe a more coordinated
> approach would be better?

Hey gabor,

Sorry for the slow response on this -- I'm just now wading through a
couple of weeks' worth of django-users and django-developers messages.
This patch is a great step forward!

Are you interested in a Subversion branch devoted to Unicoding Django?
Let me know...

Adrian

-- 
Adrian Holovaty
holovaty.com | djangoproject.com

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-10 Thread limodou

On 8/10/06, Ivan Sagalaev <[EMAIL PROTECTED]> wrote:
>
> Malcolm Tredinnick wrote:
> > I completely agree this is painful and normally I would punt. But my
> > crystal ball tells me that you will then get bug reports from Mr
> > Sagalaev, who is generally both very diligent in his debugging and likes
> > to use some language with a funny alphabet. If whatever you come up with
> > works naturally in places like Ivan's setup and maybe somebody who lives
> > in Hong Kong or Japan or some other East Asian locale, you could
> > consider this "solved" to some extent.
>
> I'm afraid I'm not very good tester with this exact problem. Python on
> my Ubuntu happily says 'UTF-8' when asked
> 'locale.getpreferredencoding()'. But indeed I can always try these
> things with my compatriots using Windows or configuring their linuxes
> with old single-byte 'KOI8-R'.
>
> In fact I was under impression that a string returned from this function
> can be safely used for decoding. For example on Russian Windows it
> returns 'cp1251' which works perfectly well while not being a standard
> ISO name which is 'windows-1251' and works well also.
>
> So may be we can just rely on Python's smart little brain and do
> something like this:
>
In python Lib/encodings/aliases.py, you would find the encoding name
mapping table.

-- 
I like python!
My Blog: http://www.donews.net/limodou
My Django Site: http://www.djangocn.org
NewEdit Maillist: http://groups.google.com/group/NewEdit

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-10 Thread Ivan Sagalaev

gabor wrote:
> hmmm.. are you sure that the situation with unicode-aware editors is so bad?
> 
> could you name some non-unicode-aware editors?
> for me it seems that from notepad through vim to eclipse everything does 
> unicode fine...

Ok, I should rephrase it. Even if most editors do support utf-8 they 
aren't configured to do so by default. Unfortunately there is some 
notion that unicode is something "new" and "scary" and "who knows what 
problems it will cause". So there is a case when on systems where utf-8 
is not default environment setting (meaning all Windows and many 
Linuxes) if a programmer starts his favorite text editor odds are that 
it will not save a new file in utf-8.

But to be sure I'll better run a poll on my forum about it...

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-10 Thread Ivan Sagalaev

Malcolm Tredinnick wrote:
> I completely agree this is painful and normally I would punt. But my
> crystal ball tells me that you will then get bug reports from Mr
> Sagalaev, who is generally both very diligent in his debugging and likes
> to use some language with a funny alphabet. If whatever you come up with
> works naturally in places like Ivan's setup and maybe somebody who lives
> in Hong Kong or Japan or some other East Asian locale, you could
> consider this "solved" to some extent.

I'm afraid I'm not very good tester with this exact problem. Python on 
my Ubuntu happily says 'UTF-8' when asked 
'locale.getpreferredencoding()'. But indeed I can always try these 
things with my compatriots using Windows or configuring their linuxes 
with old single-byte 'KOI8-R'.

In fact I was under impression that a string returned from this function 
can be safely used for decoding. For example on Russian Windows it 
returns 'cp1251' which works perfectly well while not being a standard 
ISO name which is 'windows-1251' and works well also.

So may be we can just rely on Python's smart little brain and do 
something like this:

- try decoding from locale.getpreferredencoding()
- failing that try something safe like iso-8859-1

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-09 Thread Malcolm Tredinnick

On Wed, 2006-08-09 at 21:51 +0200, gabor wrote:
[...]
> phew... the immortal 
> how-tolerant-we-should-be-when-doing-unicode-conversion problems :-)

Agreed. This is much easier on my side of the fence (lobbing problems),
than your side (solving them).

> i generally prefer to do as little guesswork as possible, but in the 
> case of the environ-variables it seems we cannot avoid it.. after all, 
> it cannot crash when parsing the environ variables, because there's no 
> way from the programmer's side to affect them.
> 
> so what do you think about the following approach:
> 
> try ascii-decoding
> if fails, try utf8-decoding
> if fails do iso-8859-1-decoding (this cannot fail).

I was thinking you could use the locale module to help you somewhat:
locale.getdefaultlocale() and locale.getpreferredencoding() might both
be useful, although experimentation is needed. For example, on my
(Linux) system, getdefaultlocale() returns ('en_AU', 'utf') and I'm
pretty sure 'utf' isn't an encoding (utf-8 is, utf-16 also, but not
plain old utf.. :-( ).

I completely agree this is painful and normally I would punt. But my
crystal ball tells me that you will then get bug reports from Mr
Sagalaev, who is generally both very diligent in his debugging and likes
to use some language with a funny alphabet. If whatever you come up with
works naturally in places like Ivan's setup and maybe somebody who lives
in Hong Kong or Japan or some other East Asian locale, you could
consider this "solved" to some extent.

All that being said, you could start off implementing your list and go
from there (although surely utf-8 decoding will also handle ASCII
strings, so you could skip the first step).

> but imho this should happen only in "special" cases like 
> environ-variables.. for example in get/post params i would prefer to 
> raise an exception when the data cannot be en/de-coded using the 
> configured charset.

*Providing* what we send in the headers is that restrictive. A server
can send what character set encodings it will accept in the header. The
client can pick any one of those to send back. So keep that on your list
of things to check (this is HTTP-level stuff).

Regards,
Malcolm


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-09 Thread gabor

Malcolm Tredinnick wrote:
> A couple of comments on the patch itself. I realise it's only a proof of
> concept at the moment, so take as more things to think about when you
> want to tidy it up:
> 
> (1) A docstring like """needed to workaround the cgi.parse_sql
> unicode-problem""" is not very future-proof. *What* parse_sql unicode
> problem? How will we know if/when it goes away? Either a quick
> description of the problem or a URL if it's tricky and explained
> elsewhere will help people who need to read this code in six months
> time.

ok

> 
> (2) You can't necessarily assume the environment is always in ASCII (or
> maybe you can; see below). For example, my current locale is set to
> en_AU.UTF-8 and I can do
> 
> export foo="€50,00"
> 
> If I'm not careful when parsing os.environ['foo'] this comes out as
> rubbish (I need to do unicode(os.environ['foo'], 'utf-8') or similar).
> 
> Probably some playing around with the locale module to work out the
> right behaviour and getting a few people to test things (e.g. Windows
> vs. Linux vs. Macs, etc) will be necessary. It's also important not to
> go too overboard here, but since arbitrary environment variables can be
> set through Apache, we need to be able to work with that to be
> "correct". Hmm ... what are the restrictions on what webservers can put
> in their config files? Maybe ASCII-only is reasonable. *shrug*
> 

phew... the immortal 
how-tolerant-we-should-be-when-doing-unicode-conversion problems :-)

i generally prefer to do as little guesswork as possible, but in the 
case of the environ-variables it seems we cannot avoid it.. after all, 
it cannot crash when parsing the environ variables, because there's no 
way from the programmer's side to affect them.

so what do you think about the following approach:

try ascii-decoding
if fails, try utf8-decoding
if fails do iso-8859-1-decoding (this cannot fail).

?


but imho this should happen only in "special" cases like 
environ-variables.. for example in get/post params i would prefer to 
raise an exception when the data cannot be en/de-coded using the 
configured charset.

> Maybe more investigation needed here.
> 
> (3) I know there are some software projects apparently using unicodize
> as a word, but ... *shudder*. Using "code" as an analogy, "unicodify"
> would be nicer (nobody uses "codize", I would hope).
> 

ok

> (4) As you go through this process, keep a list somewhere of what people
> need to do to port existing applications across to using this
> functionality. Ideally, the answer would be "not much" and we can cast
> from the default encoding to unicode internally where necessary. But I'm
> sure there will be some changes required, so keeping a list of things to
> watch out for as you go will help people test this for you.
> 

will try.

gabor

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-09 Thread Ivan Sagalaev

First of all, Gabor, thank you very much for doing this!

gabor wrote:
> today i experimented a little with the django source code,
> and here are the results.
> 
> if you apply a very small patch (65lines, attached), you can write a view
> completely in unicode.
> means:
> - GET/POST contains unicode data
> - request.META contains unicode data
> - you can put unicode text into the HttpResponse (this was already possible
> without the patch)

Here's a problem that I didn't know how to solve last time this topic
was discussed.

You can put unicode in HttpResponse. Does it imply that template 
processing should be done in unicode too? I mean, should context data
be in unicode? This would be convenient later because we will get all
the data from DB in unicode also. But this poses a problem of encoding
of actual template files.

We need to know the encoding of a template file. This can be done by
just mandating that they should be in settings.DEFAULT_CHARSET or we
should create a new setting (TEMPLATE_CHARSET). The reason of having
two different settings is that enforcing default UTF-8 in templates
means enforcing people to use unicode-aware text editors that are not
that common.


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-09 Thread Aidas Bendoraitis

Shouldn't the UTF-8 encoding be also defined in all files as described
here: http://www.python.org/dev/peps/pep-0263/ ?

That is using

#!/usr/bin/python
# -*- coding: UTF-8 -*-

at the beginning of python code files.

This works pretty good at least when you need to create new instances
of models containing multilingual characters via python script file.


Regards,
Aidas Bendoraitis [aka Archatas]


On 8/9/06, Malcolm Tredinnick <[EMAIL PROTECTED]> wrote:
>
> Hey Gabor,
>
> On Wed, 2006-08-09 at 01:03 +0200, gabor wrote:
> > today i experimented a little with the django source code,
> > and here are the results.
> >
> > if you apply a very small patch (65lines, attached), you can write a view
> > completely in unicode.
> > means:
> > - GET/POST contains unicode data
> > - request.META contains unicode data
> > - you can put unicode text into the HttpResponse (this was already possible
> > without the patch)
> >
> > of course, this patch is a demonstration only. the charset is hardcoded
> > to UTF-8 (should be settings.DEFAULT_CHARSET), and it only handles the
> > WSGI way (the mod_python one is not handled). also templating and ORM
> > are not touched. (not to mention the ugliness of the code)
> >
> > but still, i was quite surprised that with such small changes so much
> > can be done.
>
> The low-hanging fruit are definitely the place to start for this sort of
> thing.
>
> >
> > i think unicodizing django can be done in 4 easily separated steps/parts:
> >
> > 1. request/response
> > 2. templating-system
> > 3. database-system
> > 4. "overall unicode-conversion". this is mostly about replacing
> > bytestrings with u"bla" in the code, and switching __str__ to __unicode__
> >
> > my biggest problem currently is, that i do not know how to continue...
> > should i just write more and more patches to increase the
> > "unicode-coverage" to more parts of django? or maybe a more coordinated
> > approach would be better?
>
> Ultimately, getting you a svn branch to work in will probably be
> easiest. Maintaining a bunch of separate patches against a rapidly
> changing tree can be fairly time consuming. I'm not sure what the
> procedure is for that. Adrian?
>
> Keeping the changes as reasonably independent as possible is a great
> idea as far as you can take it. It will make review and testing a lot
> easier, as well as keeping you saner because you will only have to be
> looking at one layer at a time.
>
> A couple of comments on the patch itself. I realise it's only a proof of
> concept at the moment, so take as more things to think about when you
> want to tidy it up:
>
> (1) A docstring like """needed to workaround the cgi.parse_sql
> unicode-problem""" is not very future-proof. *What* parse_sql unicode
> problem? How will we know if/when it goes away? Either a quick
> description of the problem or a URL if it's tricky and explained
> elsewhere will help people who need to read this code in six months
> time.
>
> (2) You can't necessarily assume the environment is always in ASCII (or
> maybe you can; see below). For example, my current locale is set to
> en_AU.UTF-8 and I can do
>
> export foo="€50,00"
>
> If I'm not careful when parsing os.environ['foo'] this comes out as
> rubbish (I need to do unicode(os.environ['foo'], 'utf-8') or similar).
>
> Probably some playing around with the locale module to work out the
> right behaviour and getting a few people to test things (e.g. Windows
> vs. Linux vs. Macs, etc) will be necessary. It's also important not to
> go too overboard here, but since arbitrary environment variables can be
> set through Apache, we need to be able to work with that to be
> "correct". Hmm ... what are the restrictions on what webservers can put
> in their config files? Maybe ASCII-only is reasonable. *shrug*
>
> Maybe more investigation needed here.
>
> (3) I know there are some software projects apparently using unicodize
> as a word, but ... *shudder*. Using "code" as an analogy, "unicodify"
> would be nicer (nobody uses "codize", I would hope).
>
> (4) As you go through this process, keep a list somewhere of what people
> need to do to port existing applications across to using this
> functionality. Ideally, the answer would be "not much" and we can cast
> from the default encoding to unicode internally where necessary. But I'm
> sure there will be some changes required, so keeping a list of things to
> watch out for as you go will help people test this for you.
>
> Good to see somebody working on this. :-)
>
> Regards,
> Malcolm
>
>
>
> >
>

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---



Re: django unicode-conversion, beginning

2006-08-08 Thread Malcolm Tredinnick

Hey Gabor,

On Wed, 2006-08-09 at 01:03 +0200, gabor wrote:
> today i experimented a little with the django source code,
> and here are the results.
> 
> if you apply a very small patch (65lines, attached), you can write a view
> completely in unicode.
> means:
> - GET/POST contains unicode data
> - request.META contains unicode data
> - you can put unicode text into the HttpResponse (this was already possible
> without the patch)
> 
> of course, this patch is a demonstration only. the charset is hardcoded
> to UTF-8 (should be settings.DEFAULT_CHARSET), and it only handles the
> WSGI way (the mod_python one is not handled). also templating and ORM
> are not touched. (not to mention the ugliness of the code)
> 
> but still, i was quite surprised that with such small changes so much
> can be done.

The low-hanging fruit are definitely the place to start for this sort of
thing.

> 
> i think unicodizing django can be done in 4 easily separated steps/parts:
> 
> 1. request/response
> 2. templating-system
> 3. database-system
> 4. "overall unicode-conversion". this is mostly about replacing
> bytestrings with u"bla" in the code, and switching __str__ to __unicode__
> 
> my biggest problem currently is, that i do not know how to continue...
> should i just write more and more patches to increase the
> "unicode-coverage" to more parts of django? or maybe a more coordinated
> approach would be better?

Ultimately, getting you a svn branch to work in will probably be
easiest. Maintaining a bunch of separate patches against a rapidly
changing tree can be fairly time consuming. I'm not sure what the
procedure is for that. Adrian?

Keeping the changes as reasonably independent as possible is a great
idea as far as you can take it. It will make review and testing a lot
easier, as well as keeping you saner because you will only have to be
looking at one layer at a time.

A couple of comments on the patch itself. I realise it's only a proof of
concept at the moment, so take as more things to think about when you
want to tidy it up:

(1) A docstring like """needed to workaround the cgi.parse_sql
unicode-problem""" is not very future-proof. *What* parse_sql unicode
problem? How will we know if/when it goes away? Either a quick
description of the problem or a URL if it's tricky and explained
elsewhere will help people who need to read this code in six months
time.

(2) You can't necessarily assume the environment is always in ASCII (or
maybe you can; see below). For example, my current locale is set to
en_AU.UTF-8 and I can do

export foo="€50,00"

If I'm not careful when parsing os.environ['foo'] this comes out as
rubbish (I need to do unicode(os.environ['foo'], 'utf-8') or similar).

Probably some playing around with the locale module to work out the
right behaviour and getting a few people to test things (e.g. Windows
vs. Linux vs. Macs, etc) will be necessary. It's also important not to
go too overboard here, but since arbitrary environment variables can be
set through Apache, we need to be able to work with that to be
"correct". Hmm ... what are the restrictions on what webservers can put
in their config files? Maybe ASCII-only is reasonable. *shrug*

Maybe more investigation needed here.

(3) I know there are some software projects apparently using unicodize
as a word, but ... *shudder*. Using "code" as an analogy, "unicodify"
would be nicer (nobody uses "codize", I would hope).

(4) As you go through this process, keep a list somewhere of what people
need to do to port existing applications across to using this
functionality. Ideally, the answer would be "not much" and we can cast
from the default encoding to unicode internally where necessary. But I'm
sure there will be some changes required, so keeping a list of things to
watch out for as you go will help people test this for you.

Good to see somebody working on this. :-)

Regards,
Malcolm



--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~--~~~~--~~--~--~---