Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

Bjørn Stabell Sat, 27 Jan 2007 20:46:32 -0800

On Jan 28, 4:03 am, "ak" <[EMAIL PROTECTED]> wrote:
> After some thoughts I came to the following conclusion: if you guys
> want to keep support of legacy charsets in fact you don't have to
> force model objects too be unicoded. Firstly, they are passed to
> templates and filters and we can't mix legacy charsets with unicode in
> one template. Next, if I don't use unicode, I don't have to code my
> python sources (views) in unicode. So, I need to be able to pass
> string values into my model objects and my strings are not unicoded.
>
> So if everyone agreed, the way is simple:
> 1. when django loads data from db and fills in a model object, all
> strings have to be encoded according to DEFAULT_CHARSET
> 2. when django passes data from form object to model object, it has to
> encode strings according to DEFAULT_CHARSET again


This is quite confusing.  It seems you're advocating decoding/encoding 
multiple times.  Being a Norwegian involved in web development in 
China, I love Unicode, and I've been fighting with it for 6-7 years.  
This is what I've learned:

1) Unicode != external character encoding.  All programming languages 
have an internal unicode representation, and all code that needs to 
understand the concept of a "character" deals with this; e.g., 
lowercasing, sorting.  You never worry what this representation is 
(you're assuming too much about the programming language if you do).  
Instead you:

  decode from a character encoding (e.g., UTF-8, ISO8859-1, GB18030) 
into this representation
  encode this internal representation into an character encoding

UTF-8, UTF-16 are character encodings.  GB18030 is a Chinese character 
encoding that is just as capable of representing all the code points 
in the Unicode standard, same as UTF-8 and UTF-16.  Older encodings 
are usually language/locale specific, so they can only represent a 
small subset of the code points (characters) in Unicode.

I'm not sure what "unicoding", "unicodifying" means.  Is it decoding 
into the internal unicode representation, or the process of making 
your code unicode aware and compatible?

Joel has a nicely written intro: http://www.joelonsoftware.com/
articles/Unicode.html

2) Unicode is an all-or-nothing thing (not obvious).  If you try to 
use it partly, sometimes, or only somewhere, you'll end up with 
UnicodeErrors popping up everywhere and a very inefficient 
architecture with multiple encoding/decodings happenings during each 
request...  Oh this module doesn't do Unicode, better give it UTF-8, 
but then it has to pass something back, which should be of type 
unicode, but it doesn't know which character encoding we're using so 
then I have to pass that to it, ... ad nauseam.

3) Doing Unicode is (I think) worthwhile, but it is a tradeoff: 
everyone suddenly have to understand and deal with character encoding 
issues, and there's a slight performance penalty.  It's practically 
impossible to have Unicode without making these tradeoffs.  (That 
said, many environment have made these tradeoffs successfully, e.g., 
Java, C#.)  Only doing decoding/encoding at the I/O edges reduces the 
pain, however.

Rgds,
Bjorn


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: unicode issues in multiple tickets (#952, #1356, #3370) and thread about Euro sign in django-users

Reply via email to