Re: Character encoding... latin1 to utf8?

Malcolm Tredinnick Thu, 04 Dec 2008 18:13:19 -0800


On Thu, 2008-12-04 at 16:34 -0800, Rob Hudson wrote:
> I'm migrating a site to Django.  The old site was PHP/MySQL with MySQL
> having a default encoding of latin1.  It seems like there are also
> Windows 1252 encodings but I'm not sure.
> 
> I have the old database and the new Django UTF8 one side by side and
> have a migration script that uses raw MySQLdb to connect to the old,
> and Django's ORM to connect to the new.  Is there anything I can do to
> ensure the data going into the new is UTF8?


Since this is presumably a once-off conversion operation, I'd make sure
that the data coming from the source database was converted into Python
unicode objects before passing it to Django. That way, any errors will
be caught.

Now you might well be able to have this happen automatically using the
"unicode" option to MySQLdb -- it knows how to convert between various
server-side encodings and Python unicode. So look at that parameter to
the connect() call. It's fairly well done in MySQLdb (it and PostgreSQL
were almost trivial to make work when we added Unicode support to
Django).

Alternatively, if you're getting bytestrings backs, run them through a
decode() call:

        data = original_data.decode('cp1252')
        
Since cp1252 is a superset of latin1 (iso-8859-1), you can specify the
same encoding for both. Valid latin1 won't contain any data in the extra
codepoints used by cp1252. Once your data is in Unicode, passing it to
Django's ORM will Just Work(tm). However, I'd definitely call this Plan
B and see if passing the unicode=True option to MySQLdb.connect() works,
since that might just be a one-line solution.

> To further complicate things, once the data is in the new UTF8
> database, I have a script that exports to a CSV file for a client to
> use a subset of the data.  And right now this is all sorts of fail for
> me.  I tried using the Django snippet here: 
> http://www.djangosnippets.org/snippets/993/
> but am essentially getting what the first commenter says unless I
> import as Windows 1252 — then the boxes turn into quotes and
> apostrophes that look right.

I can't help there. It sounds like Excel on Windows is ignoring the fact
that the data is UTF-16 and treating it as cp1252, which is, of course,
totally broken. I don't completely understand that fragment, but the
bits that are confusing to me (it writes to both a writer and a stream,
for example) are probably because I haven't opened up the csv writer
class to see what should be subclassed.

Just for laughs, though, try running "file" on the csv file you generate
and make sure it, at least, detects that it is a UTF-16 file.

> Character encodings are a big confusion for me.

Working with things through Django should be relatively straightforward.
Django will give you Unicode strings (type "unicode"). You call the
encode() method to convert it to whichever encoding you like (unicode
objects on their own can't be written out -- you need to pick an
encoding). If your target requires UTF-16, you need to start off with a
byte order mark (BOM) to indicate whether the two-byte output of each
character is in little-endian or big-endian order. If you
call .encode('utf-16'), Python writes out the BOM for you (the '\xff
\xfe' sequence at the start).


Regards,
Malcolm



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Character encoding... latin1 to utf8?

Reply via email to