Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Since i am not familiar with python, maybe you could suggest some fast fix so that scripts doesn't hangs? Maybe this helps: In PHP we have perfeclty same error with encoding when importing some wrong decoded data. When we have no control over data and we cant all the time do utf8_encode since it could encode string twice - to bypass this error i use this function which at least prevents from postgresql error: function fix_encoding($in_str) { $cur_encoding = mb_detect_encoding($in_str) ; if($cur_encoding == UTF-8 mb_check_encoding($in_str,UTF-8)){ return $in_str; }else{ return utf8_encode($in_str); } } Maybe you can help to adapt this function to Python if similar functions are available so we can use it as a quick fix? Thanks a lot. On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Any suggestions? I found similar topic, but there were also no solutions. Yes, I've had other reports about this bug. Seems to be related to some garbage in the actors.list.gz file. I hope to have time to investigate the problem within a week or two. Thanks for the bug report! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Forrester Wave Report - Recovery time is now measured in hours and minutes not days. Key insights are discussed in the 2010 Forrester Wave Report as part of an in-depth evaluation of disaster recovery service providers. Forrester found the best-in-class provider in terms of services and vision. Read this report now! http://p.sf.net/sfu/ibm-webcastpromo___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
Maybe someone knows some fast dirty fix at least how to skip such invalid byte sequence strings while there are no official fix, so i can finish the import? Can we detect invalid byte characters? Maybe we can somehow replace or get rid of *0xc320* character, which mostly is appearing. Or skip these rows. Ananlyzed error a bit more. Mostly these errors occur in Japanese actors (actors.list), in filmography there apperars strange characters: Hayakawa, Yuzo Burai hij*8)* * * Tried to delete these rows manually, but the are too much of them :/ Thank you. On Wed, Apr 13, 2011 at 9:45 AM, darklow dark...@gmail.com wrote: Since i am not familiar with python, maybe you could suggest some fast fix so that scripts doesn't hangs? Maybe this helps: In PHP we have perfeclty same error with encoding when importing some wrong decoded data. When we have no control over data and we cant all the time do utf8_encode since it could encode string twice - to bypass this error i use this function which at least prevents from postgresql error: function fix_encoding($in_str) { $cur_encoding = mb_detect_encoding($in_str) ; if($cur_encoding == UTF-8 mb_check_encoding($in_str,UTF-8)){ return $in_str; }else{ return utf8_encode($in_str); } } Maybe you can help to adapt this function to Python if similar functions are available so we can use it as a quick fix? Thanks a lot. On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani davide.alber...@gmail.com wrote: On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Any suggestions? I found similar topic, but there were also no solutions. Yes, I've had other reports about this bug. Seems to be related to some garbage in the actors.list.gz file. I hope to have time to investigate the problem within a week or two. Thanks for the bug report! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Forrester Wave Report - Recovery time is now measured in hours and minutes not days. Key insights are discussed in the 2010 Forrester Wave Report as part of an in-depth evaluation of disaster recovery service providers. Forrester found the best-in-class provider in terms of services and vision. Read this report now! http://p.sf.net/sfu/ibm-webcastpromo___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help
Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8
On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote: File ./imdbpy2sql.py, line 1194, in _toDB CURS.executemany(self.sqlstr, self.converter(l)) psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by client_encoding. Hi all, I'm writing regarding the recent 0xc320 problem with IMDbPY. The above notice is extremely interesting, and should be investigated: how can it be that 0xc320 is not UTF8 encodable? It should work; from the Python prompt: unichr(0xc320).encode('utf8') '\xec\x8c\xa0' Anyway, as a very fast and dirty fix (the main problem is probably some crap in the data files), try this: after line 1181 of imdbpy2sql.py, add: k = k.replace('\xec\x8c\xa0', '') So that the nearby lines will become: try: k = k.replace('\xec\x8c\xa0', '') t = analyze_name(k) except IMDbParserError: Please be aware that this fix was not tested at all, but I'm almost sure that, at the above point, 'k' is a string encoded in utf8. Anyway, beside the garbage theory, I have another idea about the source of the error, but I have to verify it later... Bye, and let me know if it works! -- Davide Alberani davide.alber...@gmail.com [PGP KeyID: 0x465BFD47] http://www.mimante.net/ -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev ___ Imdbpy-help mailing list Imdbpy-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/imdbpy-help