On Mon, Jan 24, 2011 at 11:52 AM, Matthew Mondor <mm_li...@pulsar-zone.net> wrote:
> I guess that another possibility (which could offer a less complex and > more efficient interface than the SBCL way) would be for ECL to > automatically and transparently return invalid UTF-8 sequence octets as > remapped in an unassigned range, and also at UTF-8 output transparently > output those back characters of that range to the litteral octets, > while still letting the application potentially deal with that range of > "characters" as it wants, as long as it's documented... This sounds very much like the "UTF-8B" encoding frequently proposed to deal with these problems, although I'm unable to locate a formal specification of it on the web. Offhand, I'm not aware of any implementation supporting UTF-8B as an external format (although Babel can convert to/from it). This is unfortunate, as it would be the preferred default external format for many uses (terminal IO, directory listings, reading text files...) were it available. The essential property of any such scheme is that decoding and subsequent re-encoding of any random binary data be an identity function. On the other hand, I've elsewhere taken to abusing Latin-1 to store arbitrarily encoded text. Plain old (unsigned-byte 8) vectors would be better still (and often more compact, due to certain implementation choices regarding BASE-CHARs), but aren't convenient to mix with character output on existing CL streams. In these applications, I'm not certain I'd prefer UTF-8B even if it were available - the decoding/encoding effort would be redundant, and I wouldn't tolerate any measurable performance degradation when there's no user-visible benefit. It's unfortunate that manipulating text in its encoded form is incompatible with CL string semantics (for some encodings, anyway). A modern text processing library for CL might prefer operating on streams (wrapping raw encoded data) and concatenations of streams, extending the sequence functions (as supported by SBCL) to manipulate them, but that's drifting off topic. For plain UTF-8 external formats, it sounds like ECL should support condition/restart approach which other systems provide for handling invalid byte sequences, but this is a very low level solution and personally I'd hope to never find myself using it. Most real-world data shouldn't be expected to satisfy a strict UTF-8 decoder, and it'd be unfortunate if every program was forced to improvise its own solution to the problem. ------------------------------------------------------------------------------ Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! Finally, a world-class log management solution at an even better price-free! Download using promo code Free_Logger_4_Dev2Dev. Offer expires February 28th, so secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsight-sfd2d _______________________________________________ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list