Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

Matthew Mondor Sat, 12 Feb 2011 17:02:36 -0800

On Sat, 12 Feb 2011 23:49:14 +0100
Juan Jose Garcia-Ripoll <juanjose.garciarip...@googlemail.com> wrote:


> I did not realize that from your previous email. This is fixed now (trivial
> typo in utf_8_decoder)

I tested and it works fine for invalid UTF-8 bytes to LATIN-1
conversion.

I also did a test relating to my previous suggestions about a way to
preserve intact invalid input at output, later refered to as "UTF-8B"
by Andy Hefner previously, and it seems possible.

The remaining problems with UTF-8B are that it requires support by the
UTF-8 encoder because those bytes should be output as-is.  Moreover,
this may break things if the UTF-8 decoder does not transparently
support this input conversion, because for instance, the implementation
would otherwise not be able to read what it can write.  I got bitten by
stuck slime several times when decoding/encoding errors occurred if I
was not careful enough, outputting a stream with invalid characters in
UTF-8 mode, it seems that slime could not catch the decoding error in
that case (i.e. printing output of #xDCxx range characters without
passing through the latin-1 conversion function) :)

But UTF-8B could in fact be considered another encoding, and if I
really need it I might eventually send patches to have it optionally
available.  The advantages of this mode have been previously mentionned
on this list in the earlier "Unicode: uncomfortable situation" thread.
Attached is the attempt nevertheless.  It works fine except for
litteral-output.

Thanks a lot, to me it seems that ECL is on par with SBCL for UTF-8
input handling now.
-- 
Matt

(defun custom-read-line (stream &key (max 512) (utf-8b nil))
  "Reads a line from STREAM.  Uses #\Return as the line terminator, but
strips any trailing #\Return or #\Newline characters.
Will read up to MAX characters (defauts to 512).
By default, invalid encountered UTF-8 sequences will be imported as
LATIN-1 characters, unless UTF-8B, in which case invalid octets will
be imported as-is in a special unicode range suitable for later litteral
output."
  (let ((line (make-array max
                          :element-type 'character
                          :adjustable t
                          :fill-pointer 0)))
    (macrolet ((add-char (c)
                 `(vector-push ,c line))
               (add-utf-8b-octet (o)
                 `(vector-push (code-char (logior #xDC00 (logand ,o #xFF)))
                               line)))
      (flet ((finalize-line ()
               (loop
                  for c = (vector-pop line)
                  while (member c '(#\Return #\Newline))
                  finally (vector-push c line))
               line))
        (loop
           do
             (let (
                   ;; No way to determine invalid octet values with old ECL,
                   ;; Return an unknown character code
                   (c #+old-ecl(handler-case
                                   (read-char stream)
                                 (simple-error ()
                                   #\UFFFD))
                      ;; SBCL provides invalid octets which we can import and
                      ;; then issue an ATTEMPT-RESYNC restart to resume
                      #+sbcl(handler-bind
                                ((sb-int:stream-decoding-error
                                  #'(lambda (e)
                                      ;; Treat invalid UTF-8 octets as
                                      ;; ISO-8859 characters.
                                      (mapcar #'(lambda (c)
                                                  (when (> c 127)
                                                    (add-char (code-char c))))
                                              
(sb-int:character-decoding-error-octets e))
                                      (invoke-restart 'sb-int:attempt-resync))))
                              (read-char stream))
                      ;; Test with new ECL
                      #+ecl(handler-bind
                               ((ext:stream-decoding-error
                                 #'(lambda (e)
                                     (mapcar #'(lambda (o)
                                                 (if (and utf-8b (> o 127))
                                                     (add-utf-8b-octet o)
                                                     (add-char
                                                      (code-char o))))
                                             
(ext:character-decoding-error-octets e))
                                     (invoke-restart 'continue))))
                             (read-char stream))))
               (when (char= #\Newline c)
                 (return (values (finalize-line) t)))
               (add-char c)))))))

(defun display-utf8b-latin1 (string)
  "Formats a string which may contain UTF8-B encoded invalid sequence
octets for screen display, treating those as ISO-8859/LATIN-1 characters.
Note that if using this function to output back to file, the original
stream would be altered.  The UTF-8 decoder must output these bytes
litterally in the output stream to preserve the original octets."
  (map 'string #'(lambda (c)
                   (let ((n (char-code c)))
                     (if (>= #xDCFF n #xDC80)
                         (code-char (logand n #xFF))
                         c)))
       string))

(defun test ()
  (with-open-file (stream "/tmp/InvalidUTF8.txt")
    (loop
       do
         (let ((line (handler-case
                         (custom-read-line stream :utf-8b t)
                       (end-of-file ()
                         (loop-finish)))))
           (format t "~A~%" (display-utf8b-latin1 line))))))

Ceci est une phrase Ã©crite en FranÃ§ais utilisant l'encodage UTF-8.
Ceci est une phrase écrite en Français utilisant l'encodage ISO-8859-15.

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb

_______________________________________________
Ecls-list mailing list
Ecls-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ecls-list

Re: [Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

Reply via email to