Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-13 Thread darklow
Since i am not familiar with python, maybe you could suggest some fast fix
so that scripts doesn't hangs?
Maybe this helps: In PHP we have perfeclty same error with encoding when
importing some wrong decoded data. When we have no control over data and we
cant all the time do utf8_encode since it could encode string twice - to
bypass this error i use this function which at least prevents from
postgresql error:

function  fix_encoding($in_str) {
$cur_encoding = mb_detect_encoding($in_str) ;
if($cur_encoding == UTF-8  mb_check_encoding($in_str,UTF-8)){
return $in_str;
}else{
return utf8_encode($in_str);
}
}

Maybe you can help to adapt this function to Python if similar functions are
available so we can use it as a quick fix?
Thanks a lot.

On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani davide.alber...@gmail.com
 wrote:

 On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote:
 
File ./imdbpy2sql.py, line 1194, in _toDB
  CURS.executemany(self.sqlstr, self.converter(l))
  psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320
  HINT:  This error can also happen if the byte sequence does not match the
  encoding expected by the server, which is controlled by
 client_encoding.
 
  Any suggestions? I found similar topic, but there were also no solutions.

 Yes, I've had other reports about this bug.
 Seems to be related to some garbage in the actors.list.gz file.
 I hope to have time to investigate the problem within a week or two.

 Thanks for the bug report!

 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/

--
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help


Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-13 Thread darklow
Maybe someone knows some fast dirty fix at least how to skip such invalid
byte sequence strings while there are no official fix, so i can finish the
import?
Can we detect invalid byte characters? Maybe we can somehow replace or get
rid of *0xc320* character, which mostly is appearing. Or skip these rows.

Ananlyzed error a bit more. Mostly these errors occur in Japanese actors
(actors.list), in filmography there apperars strange characters:

Hayakawa, Yuzo

Burai hij*8)*
*
*

Tried to delete these rows manually, but the are too much of them :/
Thank you.


On Wed, Apr 13, 2011 at 9:45 AM, darklow dark...@gmail.com wrote:

 Since i am not familiar with python, maybe you could suggest some fast fix
 so that scripts doesn't hangs?
 Maybe this helps: In PHP we have perfeclty same error with encoding when
 importing some wrong decoded data. When we have no control over data and we
 cant all the time do utf8_encode since it could encode string twice - to
 bypass this error i use this function which at least prevents from
 postgresql error:

 function  fix_encoding($in_str) {
 $cur_encoding = mb_detect_encoding($in_str) ;
 if($cur_encoding == UTF-8  mb_check_encoding($in_str,UTF-8)){
 return $in_str;
 }else{
 return utf8_encode($in_str);
 }
 }

 Maybe you can help to adapt this function to Python if similar functions
 are available so we can use it as a quick fix?
 Thanks a lot.

 On Mon, Apr 11, 2011 at 10:46 PM, Davide Alberani 
 davide.alber...@gmail.com wrote:

 On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote:
 
File ./imdbpy2sql.py, line 1194, in _toDB
  CURS.executemany(self.sqlstr, self.converter(l))
  psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320
  HINT:  This error can also happen if the byte sequence does not match
 the
  encoding expected by the server, which is controlled by
 client_encoding.
 
  Any suggestions? I found similar topic, but there were also no
 solutions.

 Yes, I've had other reports about this bug.
 Seems to be related to some garbage in the actors.list.gz file.
 I hope to have time to investigate the problem within a week or two.

 Thanks for the bug report!

 --
 Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
 http://www.mimante.net/



--
Forrester Wave Report - Recovery time is now measured in hours and minutes
not days. Key insights are discussed in the 2010 Forrester Wave Report as
part of an in-depth evaluation of disaster recovery service providers.
Forrester found the best-in-class provider in terms of services and vision.
Read this report now!  http://p.sf.net/sfu/ibm-webcastpromo___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help


Re: [Imdbpy-help] imdbpy2sql 4.7 - invalid byte sequence for encoding UTF8

2011-04-13 Thread Davide Alberani
On Mon, Apr 11, 2011 at 18:35, darklow dark...@gmail.com wrote:

   File ./imdbpy2sql.py, line 1194, in _toDB
     CURS.executemany(self.sqlstr, self.converter(l))
 psycopg2.DataError: invalid byte sequence for encoding UTF8: 0xc320
 HINT:  This error can also happen if the byte sequence does not match the
 encoding expected by the server, which is controlled by client_encoding.

Hi all,
I'm writing regarding the recent 0xc320 problem with IMDbPY.
The above notice is extremely interesting, and should be investigated:
how can it be that 0xc320 is not UTF8 encodable?
It should work; from the Python prompt:
   unichr(0xc320).encode('utf8')
  '\xec\x8c\xa0'

Anyway, as a very fast and dirty fix (the main problem is probably some
crap in the data files), try this: after line 1181 of imdbpy2sql.py, add:
  k = k.replace('\xec\x8c\xa0', '')

So that the nearby lines will become:
try:
k = k.replace('\xec\x8c\xa0', '')
t = analyze_name(k)
except IMDbParserError:

Please be aware that this fix was not tested at all, but I'm
almost sure that, at the above point, 'k' is a string encoded in utf8.

Anyway, beside the garbage theory, I have another idea
about the source of the error, but I have to verify it later...

Bye, and let me know if it works!

-- 
Davide Alberani davide.alber...@gmail.com  [PGP KeyID: 0x465BFD47]
http://www.mimante.net/

--
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
___
Imdbpy-help mailing list
Imdbpy-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/imdbpy-help