On 2005-04-10, Tom Lane <[EMAIL PROTECTED]> wrote: > The impression I get is that most of the 'Unicode characters above > 0x10000' reports we've seen did not come from people who actually needed > more-than-16-bit Unicode codepoints, but from people who had screwed up > their encoding settings and were trying to tell the backend that Latin1 > was Unicode or some such. So I'm a bit worried that extending the > backend support to full 32-bit Unicode will do more to mask encoding > mistakes than it will do to create needed functionality.
I think you will find that this impression is actually false. Or that at the very least, _correct_ verification of UTF-8 sequences will still catch essentially all cases of non-utf-8 input mislabelled as utf-8 while allowing the full range of Unicode codepoints. (The current check will report the "characters above 0x10000" error even on input which is blatantly not utf-8 at all.) One of UTF-8's nicest properties is that other encodings are almost never also valid utf-8. I did some tests on this myself some years ago, feeding hundreds of thousands of short non-utf-8 strings (taken from Usenet subject lines in non-english-speaking hierarchies) into a utf-8 decoder. The false accept rate was on the order of 0.01%, and going back and re-checking my old data, _none_ of the incorrectly detected sequences would have been interpreted as characters over 0xFFFF. -- Andrew, Supernews http://www.supernews.com - individual and corporate NNTP services ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]