On 10/19/10 1:57 PM, Henning Hraban Ramm wrote:
> Am 2010-10-19 um 20:00 schrieb Paul McNett:
>>
>> I don't understand why this wouldn't have decoded from UTF-8, but
>> would decode from
>> latin-1. I thought UTF-8 was a superset of latin-1.
>
> You want to read
> http://www.joelonsoftware.com/articles/Unicode.html
> and
> http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror
> and
> http://docs.python.org/library/functions.html#unicode
>
> Knowing Unicode is essential, no excuses. (Even if you're probably a
> far better programmer than me in every other regard.)
Sorry, I misremembered and didn't check facts before I posted that. Both
latin-1 and
utf-8 are supersets of ascii, but utf-8 isn't a superset of latin-1. Anyway,
that
confusion was due to me being on the wrong record originally, and the user had
put
all-ascii chars into that field but characters from some unknown encoding in
the next
record, but the gist of the field value seemed to be identical. I thought
something
was happening in our stack somewhere that took the uni-char "½" and converted
it into
the ascii "1/2", which was the basis of my confusion.
I think I grasp the high-level concepts. Thanks for those links, I'll review
them
again. But in the case of Dabo it is very hard for me to wrestle down all the
I/O
interfaces at which we are supposed to encode/decode. And I never remember if
we want
to decode() or encode(), because they both seem like code to me. But I'll get
it.
>
> This can't work:
>> 369 try:
>> 370 _records = self.fetchall()
>> 371 except Exception, e:
>> 372 _records = dabo.db.dDataSet()
>> 373 # Database errors need to be decoded from database
>> encoding.
>
>> 374 try:
>> 375 errMsg = ustr(e).decode(self.Encoding)
>> 376 except UnicodeError:
>> 377 errMsg = unicode(e)
>
>
> If self.Encoding is wrong (thus UnicodeDecodeError), unicode(e) won't
> help:
>
> If you don't give an encoding in "unicode(string, encoding, errors)",
> 7bit ASCII is assumed, that's why you get:
>> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode...
Thanks for pointing that out. I've fixed that line to call ustr(e), which will
see
that it is an exception object and try various encodings before bailing. But as
you
say that isn't ideal either because it's likely still the wrong encoding but
decodes
fine because the given code points exist.
I don't understand why dabo.lib.getEncodings(), which dabo.lib.ustr() uses to
determine the list of encodings to try, puts the most likely one ("utf-8") at
the end
of the list.
So now I have the app not crashing with that data field, however I don't get
child
opening records for that order and instead get the console output:
{{{
2010-10-19 14:18:11 - ERROR - Error fetching records: Could not decode to UTF-8
column 'special_instructions' with text 'Route for Crank 6 ½” to 12” x 2
½” high
x ½” deep, Handle 13 ½” to 15” x 1” wide x ½” deep'
}}}
Now at least dCursorMixin.execute() seems to be working correctly, but
somewhere on
saving to the database (which expects utf-8) we are failing to decode from
whatever
encoding the text was in and to encode into utf-8 for saving to the database.
Else,
why were non-UTF8 chars saved?
Incidentally, Jeff's suggestion of setting the sqlite connection's
con.text_factory =
str makes the requery() work without error, only a warning that it fell back to
latin-1 because utf-8 didn't work. I need to study more of our code to work out
what
the difference is, but because of this I'm wondering about always setting
text_factory to str since it seems to do the right thing for me.
Anyway, other fish to fry currently.
Paul
>
>
> So, if you don't know the correct encoding, you could try to guess -
> decoding with a 8bit encoding like latin-1 shouldn't give an error,
> since every byte is valid (even if nonsense).
> Setting "errors" to "ignore" or "replace" could help, too. (See Python
> docs, link above.)
>
> (I didn't look up what Dabo's ustr does.)
>
> In the opposite case, if you need plain ASCII, e.g. for a filename,
> you could use something like:
>
> def force_ascii(u):
> return unicodedata.normalize('NFKD', u.lower()).encode('ASCII',
> 'ignore')
>
> This first normalizes to decomposed unicode (separate base letters and
> accents) and later throws any non-7bit-ASCII away, so you get 'a' from
> 'á'.
>
>
> Greetlings from Lake Constance!
> Hraban
> ---
> http://www.fiee.net
> https://www.cacert.org (I'm an assurer)
>
>
>
>
>
[excessive quoting removed by server]
_______________________________________________
Post Messages to: [email protected]
Subscription Maintenance: http://leafe.com/mailman/listinfo/dabo-dev
Searchable Archives: http://leafe.com/archives/search/dabo-dev
This message: http://leafe.com/archives/byMID/[email protected]