On Wed, 2005-10-26 at 08:08 +0200, Florian Weimer wrote: > It seems that the UTF-8 decoder treats the byte sequence EF BF BF as > invalid. Doesn't this mean that with your changes, it is encoded as > FFFF 00EF FFFF 00BF FFFF 00BF on the Mono side?
The UTF-8 decoder doesn't treat EF BF BF as invalid; see mcs/class/corlib/Test/System.Text/UTF8EncodingTest.cs:T5_IllegalCodePosition_3_Other_532(). Apparently .NET treats EF BF BF as the encoding of U+FFFF, which is correct, even if U+FFFF is guaranteed to never be assigned. Consequently, EF BF BF will be decoded as U+FFFF, and if it's the last character in the managed string, it will be re-encoded as EF BF BF; if there's a character after it, it will assume the following character is a byte (the usual escape mechanism), so in this case the output won't correctly match the input. I'm hoping that this scenario is sufficiently rare that things will Just Work. If it isn't, I'll have to find a different escape character. How's U+0001 sound (control character, START OF HEADING)? Something else? - Jon _______________________________________________ Mono-list maillist - [email protected] http://lists.ximian.com/mailman/listinfo/mono-list
