Matthew Mondor <mm_li...@pulsar-zone.net> writes: > On Sun, 23 Jan 2011 23:15:42 -0500 > Matthew Mondor <mm_li...@pulsar-zone.net> wrote: > >> On Sun, 23 Jan 2011 22:52:36 -0500 >> Matthew Mondor <mm_li...@pulsar-zone.net> wrote: >> >> > With ECL, the invalid sequence is already consumed when a more generic >> > error occurs (I forgot which, but could check my CVS logs on request), >> > which only allowed me to either ignore that invalid sequence or to >> > substitute it to an invalid unicode character (0x241a or 0xfffd). If >> > the output must remain unmodified, there is no other way than to use >> > bytes or 8-bit clean characters at the moment. At least a year ago or >> > more, we discussed this situation a bit on this list, yet I've not >> > looked into it again since. Perhaps this would be a good time to >> > resume this work. >> >> Attached is an example CUSTOM-READ-LINE showing the difference. > > I guess that another possibility (which could offer a less complex and > more efficient interface than the SBCL way) would be for ECL to > automatically and transparently return invalid UTF-8 sequence octets as > remapped in an unassigned range, and also at UTF-8 output transparently > output those back characters of that range to the litteral octets, > while still letting the application potentially deal with that range of > "characters" as it wants, as long as it's documented... > > Is anyone aware of the steps taken by other implementations than ECL or > SBCL when dealing with invalid UTF-8 sequences? Perhaps another > implementation already does this, in which case ECL wouldn't be totally > unique if it chose that solution.
Clisp has a slot in its encoding structure to deal with input errors (and also another for output errors). You can specify to ignore the wrong codes, to substitute them by a (unique) character, or to signal an error. The condition contains all the information, but in private slots, and there's only a restart to continue reading the input, not to give a substitute. So the lesson would be: 1- give a public API to get access to the condition attributes. 2- provide featureful restarts. http://clisp.sourceforge.net/impnotes/encoding.html#make-encoding CL-USER> (with-open-file (text "/tmp/test.data" :external-format (ext:make-encoding :charset charset:utf-8 :input-error-action :error)) (read-line text) (read-line text)) *** - READ-LINE: Invalid byte sequence #xE9 #x74 #xE9 in CHARSET:UTF-8 conversion The following restarts are available: RETRY :R1 Retry SLIME REPL evaluation request. PROCESS-INPUT :R2 Continue reading input. ABORT :R3 Return to SLIME's top level. CLOSE-CONNECTION :R4 Close SLIME connection. ABORT :R5 Abort main loop C/Break 1 USER[8]> :i #<EXT:SIMPLE-CHARSET-TYPE-ERROR #x000334353628>: standard object type: EXT:SIMPLE-CHARSET-TYPE-ERROR 0 [$DATUM]: #<ARRAY (UNSIGNED-BYTE 8) (3) #x0003343535C8> 1 [$EXPECTED-TYPE]: #<ENCODING CHARSET:UTF-8 :UNIX> 2 [$FORMAT-CONTROL]: "~S: Invalid byte sequence #x~A~A #x~A~A #x~A~A in ~S conversion " 3 [$FORMAT-ARGUMENTS]: (READ-LINE #\E #\9 #\7 #\4 #\E #\9 CHARSET:UTF-8) INSPECT-- type :h for help; :q to return to the REPL ---> -- __Pascal Bourguignon__ http://www.informatimago.com/ A bad day in () is better than a good day in {}. ------------------------------------------------------------------------------ Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)! Finally, a world-class log management solution at an even better price-free! Download using promo code Free_Logger_4_Dev2Dev. Offer expires February 28th, so secure your free ArcSight Logger TODAY! http://p.sf.net/sfu/arcsight-sfd2d _______________________________________________ Ecls-list mailing list Ecls-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ecls-list