Re: UTF-8 decoding error for characters U+10000 and above (hopefully fixed already)

Joe Wells Sun, 12 Feb 2006 18:56:07 -0800

Kenichi Handa <[EMAIL PROTECTED]> writes:

> In article <[EMAIL PROTECTED]>, Joe Wells <[EMAIL PROTECTED]> writes:
>
>> I'm using the Gentoo ebuild app-editors/emacs-22.0.50_pre20050225
>> which is based on a CVS snapshot from last year.
>
>> Try evaluating this:
>
>>   (let ((unicode-char-hex-string
>>          (format "%x"
>>                  (encode-char
>>                   (aref (decode-coding-string
>>                          ;; UTF-8 for U+1D161 (MUSICAL SYMBOL SIXTEENTH 
>> NOTE):
>>                          "\355\205\241"
>>                          'utf-8) 0)
>>                   'ucs))))
>>     (if (equal "d161" unicode-char-hex-string)
>>         (error "Oh no!  Emacs dropped 17th bit when decoding the 
>> character!")))
>
> That version of Emacs supports only BMP as written in the
> documenation of utf-8 coding system.


Yes, but it should handle the character in the same way as any other
character outside of its range.  There is this comment in utf-8.el:

  ;; We compose the untranslatable sequences into a single character,
  ;; and move point to the next character.
  ;; This is infelicitous for editing, because there's currently no
  ;; mechanism for treating compositions as atomic, but is OK for
  ;; display.  They are composed to U+FFFD with help-echo which
  ;; indicates the unicodes they represent.  ...

In my case, this seemed not to be working.  Instead, it seemed it was
translating the sequence to the wrong character.

However, I have since discovered the real problem.  I was editing the
file /usr/lib/X11/locale/en_US.UTF-8/Compose and it has a line that
reads like this:

----------------------------------------------------------------------
<Multi_key> <U1d15f> <U1d16f>   : "텡" U1D161 # MUSICAL SYMBOL SIXTEENTH NOTE
----------------------------------------------------------------------

However, although it claims on the line that the code of the character
in the quotes is U+1D161, in fact the character there is actually
U+D161 encoded in UTF-8 as ED 85 A1.  The correct UTF-8 encoding of
U+1D161 would be F0 9D 85 A1.

Sorry for the false alarm!  The bug is in the xorg-x11 distribution
on my machine.  I was wrong to believe this file was correct.

-- 
Joe

> u -- utf-8 (alias of mule-utf-8)
>
> UTF-8 encoding for Emacs-supported Unicode characters.
> It supports Unicode characters of these ranges:
>     U+0000..U+33FF, U+E000..U+FFFF.
> They correspond to these Emacs character sets:
>     ascii, latin-iso8859-1, mule-unicode-0100-24ff,
>     mule-unicode-2500-33ff, mule-unicode-e000-ffff
> [...]
>
> ---
> Kenichi Handa
> [EMAIL PROTECTED]


_______________________________________________
emacs-pretest-bug mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/emacs-pretest-bug

Re: UTF-8 decoding error for characters U+10000 and above (hopefully fixed already)

Reply via email to