Re: Character encoding... latin1 to utf8?

Rob Hudson Sat, 06 Dec 2008 08:11:12 -0800

Thanks Malcolm,

On Dec 4, 6:12 pm, Malcolm Tredinnick <[EMAIL PROTECTED]>
wrote:
> Now you might well be able to have this happen automatically using the
> "unicode" option to MySQLdb -- it knows how to convert between various
> server-side encodings and Python unicode. So look at that parameter to
> the connect() call. It's fairly well done in MySQLdb (it and PostgreSQL
> were almost trivial to make work when we added Unicode support to
> Django).


I actually had that set up already.  I'm trying to look at it a little
more closely.  Here's a dpaste of a SQL call and a few columns.  Look
at the "fdescr" column output... it's showing the string is unicode
but it has some characters in it like \x95 and \x92.
http://dpaste.com/96601/

> Alternatively, if you're getting bytestrings backs, run them through a
> decode() call:
>
>         data = original_data.decode('cp1252')

I tried this at the bottom of the above dpaste just to see... I know
I'm not getting bytestrings back.  So I tried it also without the
unicode=True flag to connect and it produces different output than
above:

>>> row['fdescr'].decode('cp1252')
u'Lefty Kreh is one of the most experienced, well-prepared, and
thoughtful anglers in the world. In <i>101 Fly-Fishing Tips</i>, he
shares this wealth of experience with a variety of common-sense
solutions to the problems that anglers face. Included are tips on:<br /
> \u2022how to pacify a fish<br /> \u2022which hook-sharpening tools
to use and when<br /> \u2022how to take a rod apart when it\u2019s
stuck<br /> \u2022what to do when a fish runs under your boat<br />
\u2022how to dry waders and find leaks<br /> \u2022why long hat brims
affect casting accuracy<br /> \u2022and much more<br /><br />Sure to
improve a fly fisher\u2019s success, comfort, and enjoyment while on
the water. A must for any angler.<br /><br /><b>ABOUT THE AUTHOR</
b><br />Lefty Kreh is an internationally known and respected master in
the field of fly fishing, and the author of numerous articles and
books on the subject. He lives in Maryland.'

Now instead of \x95 I get \u2022 (which is a bullet).

>From here I'm not sure what the best way to proceed is... do I want
the \u2022 version instead, in which case, should I not pass in
unicode=True and manually decode each column?

I'm partly thinking that since this is a one-time operation (actually,
it's a many one-time operation until we're ready to switch over to the
new site), I could scan for any "\x" characters and manually replace
them.  There are likely only a handful as in the above.  But how does
one scan and replace these so the output is correct?

> Just for laughs, though, try running "file" on the csv file you generate
> and make sure it, at least, detects that it is a UTF-16 file.

It actually tells me nothing...
> file export.csv
export.csv:

Thanks,
Rob
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Character encoding... latin1 to utf8?

Reply via email to