Jacob Kruger wrote:
> I'm busy using something like pyodbc to pull data out of MS access .mdb
> files, and then generate .sql script files to execute against MySQL
> databases using MySQLdb module, but, issue is forms of characters in
> string values that don't fit inside the 0-127 range - current one seems to
> be something like \xa3, and if I pass it through ord() function, it comes
> out as character number 163.
>
> Now issue is, yes, could just run through the hundreds of thousands of
> characters in these resulting strings, and strip out any that are not
> within the basic 0-127 range, but, that could result in corrupting data -
> think so anyway.
>
> Anyway, issue is, for example, if I try something like
> str('\xa3').encode('utf-8') or str('\xa3').encode('ascii'), or
"\xa3" already is a str; str("\xa3") is as redundant as
str(str(str("\xa3"))) ;)
> str('\xa3').encode('latin7') - that last one is actually our preferred
> encoding for the MySQL database - they all just tell me they can't work
> with a character out of range.
encode() goes from unicode to byte; you want to convert bytes to unicode and
thus need decode().
In this context it is important that you tell us the Python version. In
Python 2 str.encode(encoding) is basically
str.decode("ascii").encode(encoding)
which is why you probably got a UnicodeDecodeError in the traceback:
>>> "\xa3".encode("latin7")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/encodings/iso8859_13.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 0:
ordinal not in range(128)
>>> "\xa3".decode("latin7")
u'\xa3'
>>> print "\xa3".decode("latin7")
£
Aside: always include the traceback in your posts -- and always read it
carefully. The fact that "latin7" is not mentioned might have given you a
hint that the problem was not what you thought it was.
> Any thoughts on a sort of generic method/means to handle any/all
> characters that might be out of range when having pulled them out of
> something like these MS access databases?
Assuming the data in Access is not broken and that you know the encoding
decode() will work.
> Another side note is for binary values that might store binary values, I
> use something like the following to generate hex-based strings that work
> alright when then inserting said same binary values into longblob fields,
> but, don't think this would really help for what are really just most
> likely badly chosen copy/pasted strings from documents, with strange
> encoding, or something:
> #sample code line for binary encoding into string output
> s_values += "0x" + str(l_data[J][I]).encode("hex").replace("\\", "\\\\") +
> ", "
I would expect that you can feed bytestrings directly into blobs, without
any preparatory step. Try it, and if you get failures show us the failing
code and the corresponding traceback.
--
https://mail.python.org/mailman/listinfo/python-list