On Jul 08, Andrew Pendleton <[EMAIL PROTECTED]> wrote: > I've got a small issue with imdbpy2sql.py.
/me cries thinking about unicode and encoding. ;-) Short version: in the CVS there's a fix for this HORRIBLE beast. :-) > psycopg.ProgrammingError: ERROR: invalid byte sequence for encoding > "UTF8": 0xc327 Uuuu... this was a NASTY bug... :-) I admit I've thought about anything from server configuration problems to interferences by little green aliens, before I spotted the bug: it was a combination of garbage in the actors.list file, an error in the imdbpy2sql.py and yet another strange behavior I've forgotten to consider handling utf-8 strings. And some little green aliens, that's for sure! ;-) The facts: in the actors.list file there is, in the filmography of actor "Franck, Kari", this line: "Kahdeksan surmanluotia" (1972) (mini) [Puhuja Helsingistä - speaker from Helsinki] <33> There aren't strange chars and everything should be handled correctly by imdbpy2sql.py (the "a" with the dieresis/umlaut is just a normal char in iso-8859-1, the encoding used by the plain text data files). BUT... there are two spaces before the "-" in the field used to store the role of the actor plus optional notes. imdbpy2sql.py reads the actors.list file line by line and splits every line using exactly two spaces as fields separator. So far, not a big deal... BUT^2... my code just checked if a field _started_ with "[" to identify a role field, and so not checking that it also ended with "]"; it just takes the field and cuts the first and the last bytes (the square parentheses). Again, not a big deal... BUT^3... in this specific case the last char of what is assumed to be the role of the actor ("Puhuja Helsingistä", without anything after because it was splitted, being followed by two spaces) happens to be "ä". Internally imdbpy2sql.py manages every text as a plain string (and _not_ in unicode) encoded in utf-8. I.e.: unicode_auml = u'\xe4' utf8string_auml = unicode_auml.encode('utf8') utf8string_auml is '\xc3\xa4' (two bytes), and obviously a valid representation of an utf8 char. imdbpy2sql.py, assuming that the last char of a role field is "]", just cuts the last byte... BUT^4... in this case what's left is a string ending with '\xc3' that is _not_ a valid representation of a utf8 char! Solution: instead of cutting the last byte, whatever it was, now I only strip every "]" at the end of the string. A funny fact is that the insertion of such an invalid string is handled gracefully by MySQL, so I've never noticed it before. > This is using Python 2.5, SQLObject 0.7.1, and python-psycopg 1.1.21. > Would upgrading to newer versions of any of these fix my issue, Except for Python, these are old releases; anyway the problem, using postgres, arises even with SQLObject 0.10 and python-psycopg 2.0b8. Beware that - here I'm only talking about the imdbpy2sql.py script - postgres is somewhat slow, comparing to MySQL, so it will takes some time to complete. Thank you very much for this bug report! I'll add your name to the credits. -- Davide Alberani <[EMAIL PROTECTED]> [PGP KeyID: 0x465BFD47] http://erlug.linux.it/~da/ ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Imdbpy-devel mailing list Imdbpy-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-devel