On Jan 4, 12:02 am, John Machin <[EMAIL PROTECTED]> wrote: > On Jan 4, 8:03 am, mario <[EMAIL PROTECTED]> wrote: > > On Jan 2, 2:25 pm, Piet van Oostrum <[EMAIL PROTECTED]> wrote: > > > > Apparently for the empty string the encoding is irrelevant as it will not > > > be used. I guess there is an early check for this special case in the > > > code. > > > In the module I an working on [*] I am remembering a failed encoding > > to allow me, if necessary, to later re-process fewer encodings. > > If you were in fact doing that, you would not have had a problem. What > you appear to have been doing is (a) remembering a NON-failing > encoding, and assuming that it would continue not to fail
Yes, exactly. But there is no difference which ones I remember as the two subsets will anyway add up to always the same thing. In this special case (empty string!) the unccode() call does not fail... > (b) not > differentiating between failure reasons (codec doesn't exist, input > not consistent with specified encoding). There is no failure in the first pass in this case... if I do as you suggest further down, that is to use s.decode(encoding) instead of unicode(s, encoding) to force the lookup, then I could remember the failure reason to be able to make a decision about how to proceed. However I am aiming at an automatic decision, thus an in-context error message would need to be replaced with a more rigourous info about how the guessing should proceed. I am also trying to keep this simple ;) <snip> > In any case, a pointless question (IMHO); the behaviour is extremely > unlikely to change, as the chance of breaking existing code outvotes > any desire to clean up a minor inconsistency that is easily worked > around. Yes, I would agree. The work around may not even be worth it though, as what I really want is a unicode object, so changing from calling unicode() to s.decode() is not quite right, and will anyway require a further check. Less clear code, and a little unnecessary performance hit for the 99.9 majority of cases... Anyhow, I have improved a little further the "post guess" checking/refining logic of the algorithm [*]. What I'd like to understand better is the "compatibility heirarchy" of known encodings, in the positive sense that if a string decodes successfully with encoding A, then it is also possible that it will encode with encodings B, C; and in the negative sense that is if a string fails to decode with encoding A, then for sure it will also fail to decode with encodings B, C. Any ideas if such an analysis of the relationships between encodings exists? Thanks! mario [*] http://gizmojo.org/code/decodeh/ -- http://mail.python.org/mailman/listinfo/python-list