On 1/24/11 9:01 AM, Pascal J. Bourguignon wrote: > Matthew Mondor <mm_li...@pulsar-zone.net> > writes: > >> On Sun, 23 Jan 2011 23:15:42 -0500 >> Matthew Mondor <mm_li...@pulsar-zone.net> wrote: >> >>> On Sun, 23 Jan 2011 22:52:36 -0500 >>> Matthew Mondor <mm_li...@pulsar-zone.net> wrote: >>> >>>> With ECL, the invalid sequence is already consumed when a more generic >>>> error occurs (I forgot which, but could check my CVS logs on request), >>>> which only allowed me to either ignore that invalid sequence or to >>>> substitute it to an invalid unicode character (0x241a or 0xfffd). If >>>> the output must remain unmodified, there is no other way than to use >>>> bytes or 8-bit clean characters at the moment. At least a year ago or >>>> more, we discussed this situation a bit on this list, yet I've not >>>> looked into it again since. Perhaps this would be a good time to >>>> resume this work. >>> Attached is an example CUSTOM-READ-LINE showing the difference. >> I guess that another possibility (which could offer a less complex and >> more efficient interface than the SBCL way) would be for ECL to >> automatically and transparently return invalid UTF-8 sequence octets as >> remapped in an unassigned range, and also at UTF-8 output transparently >> output those back characters of that range to the litteral octets, >> while still letting the application potentially deal with that range of >> "characters" as it wants, as long as it's documented... >> >> Is anyone aware of the steps taken by other implementations than ECL or >> SBCL when dealing with invalid UTF-8 sequences? Perhaps another >> implementation already does this, in which case ECL wouldn't be totally >> unique if it chose that solution. > Clisp has a slot in its encoding structure to deal with input errors > (and also another for output errors). You can specify to ignore the > wrong codes, to substitute them by a (unique) character, or to signal an > error. The condition contains all the information, but in private > slots, and there's only a restart to continue reading the input, not to > give a substitute. > > > So the lesson would be: > > 1- give a public API to get access to the condition attributes. > 2- provide featureful restarts. > > > > > http://clisp.sourceforge.net/impnotes/encoding.html#make-encoding > > > > CL-USER> (with-open-file (text "/tmp/test.data" > :external-format (ext:make-encoding > :charset charset:utf-8 > :input-error-action :error)) > (read-line text) (read-line text)) FWIW, CMUCL has something similar:
(with-open-file (s "/tmp/test.data" :external-format :utf8 :decoding-error t) (read-line s)) Error in function "DEFUN MAKE-FD-STREAM": Invalid utf8 octet #x74 at offset 1 [Condition of type SIMPLE-ERROR] Restarts: 0: [CONTINUE] Use Unicode replacement character instead 1: [RETRY] Retry SLIME REPL evaluation request. 2: [*ABORT] Return to SLIME's top level. 3: [ABORT] Return to Top-Level. For :decoding-error, you can also specify a function to handle the errors in the way that you want. Ray ------------------------------------------------------------------------------ Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! Finally, a world-class log management solution at an even better price-free! Download using promo code Free_Logger_4_Dev2Dev. Offer expires February 28th, so secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsight-sfd2d _______________________________________________ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list