Re: [Imdbpy-devel] Another UTF-8 issue

Davide Alberani Mon, 09 Jul 2007 07:29:14 -0700

On Jul 08, Andrew Pendleton <[EMAIL PROTECTED]> wrote:

> I've got a small issue with imdbpy2sql.py.


/me cries thinking about unicode and encoding. ;-)

Short version: in the CVS there's a fix for this HORRIBLE beast. :-)

> psycopg.ProgrammingError: ERROR:  invalid byte sequence for encoding
> "UTF8": 0xc327

Uuuu... this was a NASTY bug... :-)
I admit I've thought about anything from server configuration problems
to interferences by little green aliens, before I spotted the bug:
it was a combination of garbage in the actors.list file, an error
in the imdbpy2sql.py and yet another strange behavior I've forgotten
to consider handling utf-8 strings.
And some little green aliens, that's for sure! ;-)

The facts: in the actors.list file there is, in the filmography of
actor "Franck, Kari", this line:
  "Kahdeksan surmanluotia" (1972) (mini)  [Puhuja Helsingistä  - speaker from 
Helsinki]  <33>

There aren't strange chars and everything should be handled correctly
by imdbpy2sql.py (the "a" with the dieresis/umlaut is just a normal
char in iso-8859-1, the encoding used by the plain text data files).

BUT... there are two spaces before the "-" in the field used to
store the role of the actor plus optional notes.
imdbpy2sql.py reads the actors.list file line by line and splits
every line using exactly two spaces as fields separator.
So far, not a big deal...

BUT^2... my code just checked if a field _started_ with "[" to
identify a role field, and so not checking that it also ended
with "]"; it just takes the field and cuts the first and the
last bytes (the square parentheses).
Again, not a big deal...

BUT^3... in this specific case the last char of what is assumed
to be the role of the actor ("Puhuja Helsingistä", without anything
after because it was splitted, being followed by two spaces)
happens to be "ä".
Internally imdbpy2sql.py manages every text as a plain string (and
_not_ in unicode) encoded in utf-8.
I.e.:
  unicode_auml = u'\xe4'
  utf8string_auml = unicode_auml.encode('utf8')
utf8string_auml is '\xc3\xa4' (two bytes), and obviously a valid
representation of an utf8 char.
imdbpy2sql.py, assuming that the last char of a role field
is "]", just cuts the last byte...

BUT^4... in this case what's left is a string ending with '\xc3'
that is _not_ a valid representation of a utf8 char!

Solution: instead of cutting the last byte, whatever it was,
now I only strip every "]" at the end of the string.

A funny fact is that the insertion of such an invalid string is
handled gracefully by MySQL, so I've never noticed it before.

> This is using Python 2.5, SQLObject 0.7.1, and python-psycopg 1.1.21.
> Would upgrading to newer versions of any of these fix my issue,

Except for Python, these are old releases; anyway the problem, using
postgres, arises even with SQLObject 0.10 and python-psycopg 2.0b8.
Beware that - here I'm only talking about the imdbpy2sql.py script -
postgres is somewhat slow, comparing to MySQL, so it will takes
some time to complete.


Thank you very much for this bug report!  I'll add your name to
the credits.
-- 
Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47]
http://erlug.linux.it/~da/

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Imdbpy-devel mailing list
Imdbpy-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-devel

Re: [Imdbpy-devel] Another UTF-8 issue

Reply via email to