Re: Extra Unicode character for MacSymbol 0xE4?

Dan Kogai Tue, 21 Feb 2006 07:30:37 -0800

Ciaran,

Thank you for your concern.


On Feb 22, 2006, at 00:14 , Ciaran Hamilton wrote:

Hi Dan,
I'm working with the Encode module in my Perl programs fairlyextensively,
and firstly I have to thank you for maintaining this. It's such a
brilliant module as it makes life so much easier when dealing with
encodings. Thank you!
I have a question about the MacSymbol encoding as implemented byEncode.
In my tests (and as seen at
http://search.cpan.org/src/DANKOGAI/Encode-2.14/ucm/macSymbol.ucm ), itseems that the character at position 0xE4 in the MacSymbol charsetseemsto be converted by decode() to two Unicode characters - U+2122 and U+F87F.


You mean this.

<U2122><UF87F> \xE4 |3 # TRADE MARK SIGN, alternate: sans serif

perldoc enc2xs

o CHARMAP starts the character map section. Each line hasa form as
           follows:

             <UXXXX> \xXX.. |0 # comment
               ^     ^      ^
               |     |      +- Fallback flag
               |     +-------- Encoded byte sequence
               +-------------- Unicode Character ID in hex
The format is roughly the same as a header sectionexcept for thefallback flag: | followed by 0..3. The meaning of thepossible
           values is as follows:
|0 Round trip safe. A character decoded to Unicodeencodes backto the same byte sequence. Most characters havethis flag.
|1 Fallback for unicode -> encoding. When seen, enc2xsadds this
               character for the encode map only.

           |2  Skip sub-char mapping should there be no code point.
|3 Fallback for encoding -> unicode. When seen, enc2xsadds this
               character for the decode map only.


Since it is marked |3, this map is used for decode() only.

It seems that the second character is part of the Private Use area,and
because of this, the second character is meaningless to me, without
knowledge of what layout was used. I would guess that it's used inMacOSto signify something or other, but I'm developing on Windows andLinux.

Frankly I don't know that either. All I did was convert whateverapple supplies to Unicode Consortium


http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT

0xE4    0x2122+0xF87F   # TRADE MARK SIGN, alternate: sans serif

It seems to me that it would make sense for this extra character to be

removed from the conversion output, since I assume Encode is meantto be

cross-platform compatible. Alternatively, if it's meant to be there,
please let me know as I would be interested to know what the extra
character is used for.

Since the map came from Apple, it is their responsibility to fix theoriginal map. Meanwhile, you can use the quick fix below;


$utf8 = decode("macSymbol", $octet);
$utf8 =~ tr/\x{F87F}//d;

If you are not content with that, you can even make your ownencoding. Consult

"perldoc enc2xs" on how to do it.

Thank you,

 - Ciaran.


Thank YOU.

Dan the Encode Maintainer

Re: Extra Unicode character for MacSymbol 0xE4?

Reply via email to