On Thu, 2008-12-04 at 16:34 -0800, Rob Hudson wrote: > I'm migrating a site to Django. The old site was PHP/MySQL with MySQL > having a default encoding of latin1. It seems like there are also > Windows 1252 encodings but I'm not sure. > > I have the old database and the new Django UTF8 one side by side and > have a migration script that uses raw MySQLdb to connect to the old, > and Django's ORM to connect to the new. Is there anything I can do to > ensure the data going into the new is UTF8?
Since this is presumably a once-off conversion operation, I'd make sure that the data coming from the source database was converted into Python unicode objects before passing it to Django. That way, any errors will be caught. Now you might well be able to have this happen automatically using the "unicode" option to MySQLdb -- it knows how to convert between various server-side encodings and Python unicode. So look at that parameter to the connect() call. It's fairly well done in MySQLdb (it and PostgreSQL were almost trivial to make work when we added Unicode support to Django). Alternatively, if you're getting bytestrings backs, run them through a decode() call: data = original_data.decode('cp1252') Since cp1252 is a superset of latin1 (iso-8859-1), you can specify the same encoding for both. Valid latin1 won't contain any data in the extra codepoints used by cp1252. Once your data is in Unicode, passing it to Django's ORM will Just Work(tm). However, I'd definitely call this Plan B and see if passing the unicode=True option to MySQLdb.connect() works, since that might just be a one-line solution. > To further complicate things, once the data is in the new UTF8 > database, I have a script that exports to a CSV file for a client to > use a subset of the data. And right now this is all sorts of fail for > me. I tried using the Django snippet here: > http://www.djangosnippets.org/snippets/993/ > but am essentially getting what the first commenter says unless I > import as Windows 1252 — then the boxes turn into quotes and > apostrophes that look right. I can't help there. It sounds like Excel on Windows is ignoring the fact that the data is UTF-16 and treating it as cp1252, which is, of course, totally broken. I don't completely understand that fragment, but the bits that are confusing to me (it writes to both a writer and a stream, for example) are probably because I haven't opened up the csv writer class to see what should be subclassed. Just for laughs, though, try running "file" on the csv file you generate and make sure it, at least, detects that it is a UTF-16 file. > Character encodings are a big confusion for me. Working with things through Django should be relatively straightforward. Django will give you Unicode strings (type "unicode"). You call the encode() method to convert it to whichever encoding you like (unicode objects on their own can't be written out -- you need to pick an encoding). If your target requires UTF-16, you need to start off with a byte order mark (BOM) to indicate whether the two-byte output of each character is in little-endian or big-endian order. If you call .encode('utf-16'), Python writes out the BOM for you (the '\xff \xfe' sequence at the start). Regards, Malcolm --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-users@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-users?hl=en -~----------~----~----~----~------~----~------~--~---