On 10/19/10 1:57 PM, Henning Hraban Ramm wrote:
> Am 2010-10-19 um 20:00 schrieb Paul McNett:
>>
>> I don't understand why this wouldn't have decoded from UTF-8, but
>> would decode from
>> latin-1. I thought UTF-8 was a superset of latin-1.
>
> You want to read
> http://www.joelonsoftware.com/articles/Unicode.html
> and
> http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror
> and
> http://docs.python.org/library/functions.html#unicode
>
> Knowing Unicode is essential, no excuses. (Even if you're probably a
> far better programmer than me in every other regard.)

Sorry, I misremembered and didn't check facts before I posted that. Both 
latin-1 and 
utf-8 are supersets of ascii, but utf-8 isn't a superset of latin-1. Anyway, 
that 
confusion was due to me being on the wrong record originally, and the user had 
put 
all-ascii chars into that field but characters from some unknown encoding in 
the next 
record, but the gist of the field value seemed to be identical. I thought 
something 
was happening in our stack somewhere that took the uni-char "½" and converted 
it into 
the ascii "1/2", which was the basis of my confusion.

I think I grasp the high-level concepts. Thanks for those links, I'll review 
them 
again. But in the case of Dabo it is very hard for me to wrestle down all the 
I/O 
interfaces at which we are supposed to encode/decode. And I never remember if 
we want 
to decode() or encode(), because they both seem like code to me. But I'll get 
it.

>
> This can't work:
>>   369     try:
>>   370       _records = self.fetchall()
>>   371     except Exception, e:
>>   372       _records = dabo.db.dDataSet()
>>   373       # Database errors need to be decoded from database
>> encoding.
>
>>   374       try:
>>   375         errMsg = ustr(e).decode(self.Encoding)
>>   376       except UnicodeError:
>>   377         errMsg = unicode(e)
>
>
> If self.Encoding is wrong (thus UnicodeDecodeError), unicode(e) won't
> help:
>
> If you don't give an encoding in "unicode(string, encoding, errors)",
> 7bit ASCII is assumed, that's why you get:
>> <type 'exceptions.UnicodeDecodeError'>: 'ascii' codec can't decode...

Thanks for pointing that out. I've fixed that line to call ustr(e), which will 
see 
that it is an exception object and try various encodings before bailing. But as 
you 
say that isn't ideal either because it's likely still the wrong encoding but 
decodes 
fine because the given code points exist.

I don't understand why dabo.lib.getEncodings(), which dabo.lib.ustr() uses to 
determine the list of encodings to try, puts the most likely one ("utf-8") at 
the end 
of the list.

So now I have the app not crashing with that data field, however I don't get 
child 
opening records for that order and instead get the console output:

{{{
2010-10-19 14:18:11 - ERROR - Error fetching records: Could not decode to UTF-8 
column 'special_instructions' with text 'Route for Crank 6 ½” to 12” x 2 
½” high 
x ½” deep, Handle 13 ½” to 15” x 1” wide x ½” deep'
}}}

Now at least dCursorMixin.execute() seems to be working correctly, but 
somewhere on 
saving to the database (which expects utf-8) we are failing to decode from 
whatever 
encoding the text was in and to encode into utf-8 for saving to the database. 
Else, 
why were non-UTF8 chars saved?

Incidentally, Jeff's suggestion of setting the sqlite connection's 
con.text_factory = 
str makes the requery() work without error, only a warning that it fell back to 
latin-1 because utf-8 didn't work. I need to study more of our code to work out 
what 
the difference is, but because of this I'm wondering about always setting 
text_factory to str since it seems to do the right thing for me.

Anyway, other fish to fry currently.
Paul

>
>
> So, if you don't know the correct encoding, you could try to guess -
> decoding with a 8bit encoding like latin-1 shouldn't give an error,
> since every byte is valid (even if nonsense).
> Setting "errors" to "ignore" or "replace" could help, too. (See Python
> docs, link above.)
>
> (I didn't look up what Dabo's ustr does.)
>
> In the opposite case, if you need plain ASCII, e.g. for a filename,
> you could use something like:
>
> def force_ascii(u):
>       return unicodedata.normalize('NFKD', u.lower()).encode('ASCII',
> 'ignore')
>
> This first normalizes to decomposed unicode (separate base letters and
> accents) and later throws any non-7bit-ASCII away, so you get 'a' from
> 'á'.
>
>
> Greetlings from Lake Constance!
> Hraban
> ---
> http://www.fiee.net
> https://www.cacert.org (I'm an assurer)
>
>
>
>
>
[excessive quoting removed by server]

_______________________________________________
Post Messages to: [email protected]
Subscription Maintenance: http://leafe.com/mailman/listinfo/dabo-dev
Searchable Archives: http://leafe.com/archives/search/dabo-dev
This message: http://leafe.com/archives/byMID/[email protected]

Reply via email to