Daiki Ueno <[email protected]> writes: > Ben Pfaff <[email protected]> writes: > >> On Thu, Oct 09, 2014 at 06:04:02PM +0200, David Kastrup wrote: >>> What I am actually more interested in is in having libunistring offer >>> "roundtrippable" encodings as a fallback for decoding errors. >>> Basically, I want an option for decoding where libunistring announces >>> "what you have here is not valid utf-8 but I know how to deal with it". >>> Including reencoding. And delivering unique "character codes" and >>> string length calculations. The application would either keep track of >>> having received "dirty utf-8" and would reencode when putting out utf-8 >>> (where reencoding "internal utf-8" to "external utf-8" means replacing >>> the 2-byte sequences representing a wild byte by their original byte), >>> or it would reencode into "external" utf-8 when writing anyway which >>> would not change anything for originally valid utf-8. >> >> It sounds like a reasonable philosophy to me. I don't think I'd want >> this to become the only option for libunistring, but if there's a >> practical way to add alternate interfaces, etc., then I think that would >> be valuable. > > I don't have anything to add. I think it would be nice if Guile had a > transparent support for "raw-bytes" and UTF-8 sequences[1], but I don't > think it is a good idea to expose internal "character codes" or > "internal utf-8" representation from the library interface. > > [1] for example, the results of decoding external byte sequences > "\xC2\xA0" and "\xA0" should report the same character code in REPL, > but they are internally distinguished and converted to the original > bytes when writing, like Emacs does.
In this respect I beg to differ. Carrying "invisible" information is a recipe for security problems and/or inscrutable behavior. It would also mean that in some use cases producing and consuming strings, this invisible information would just disappear. Since a "raw byte" is not the same as a character, I see no particular point in selling it as something belonging to the Unicode codepoint space: the purpose for this proposal is not that it would make it particularly convenient to do codepoint-based processing on them: if it were, one would not have decoded the byte stream in the first place. And since a fair number of random byte combinations _do_ decode under utf-8 into proper Unicode characters, this representation would not really be much use in that respect. Rather the point is being not to lose information by default and being free to choose one's fallback strategies, including transparent treatment of binary passages in mixed-mode files or streams, without significant performance degradation. Out-of-unicode-proper character points seem like they would usually work well with things like positive and negative character ranges in regular expressions, not matching where they are supposed not to match. -- David Kastrup
