Ciaran,

Thank you for your concern.

On Feb 22, 2006, at 00:14 , Ciaran Hamilton wrote:
Hi Dan,

I'm working with the Encode module in my Perl programs fairly extensively,
and firstly I have to thank you for maintaining this. It's such a
brilliant module as it makes life so much easier when dealing with
encodings. Thank you!

I have a question about the MacSymbol encoding as implemented by Encode.
In my tests (and as seen at
http://search.cpan.org/src/DANKOGAI/Encode-2.14/ucm/ macSymbol.ucm ), it seems that the character at position 0xE4 in the MacSymbol charset seems to be converted by decode() to two Unicode characters - U+2122 and U +F87F.

You mean this.

<U2122><UF87F> \xE4 |3 # TRADE MARK SIGN, alternate: sans serif

perldoc enc2xs
o CHARMAP starts the character map section. Each line has a form as
           follows:

             <UXXXX> \xXX.. |0 # comment
               ^     ^      ^
               |     |      +- Fallback flag
               |     +-------- Encoded byte sequence
               +-------------- Unicode Character ID in hex

The format is roughly the same as a header section except for the fallback flag: | followed by 0..3. The meaning of the possible
           values is as follows:

|0 Round trip safe. A character decoded to Unicode encodes back to the same byte sequence. Most characters have this flag.

|1 Fallback for unicode -> encoding. When seen, enc2xs adds this
               character for the encode map only.

           |2  Skip sub-char mapping should there be no code point.

|3 Fallback for encoding -> unicode. When seen, enc2xs adds this
               character for the decode map only.

Since it is marked |3, this map is used for decode() only.

It seems that the second character is part of the Private Use area, and
because of this, the second character is meaningless to me, without
knowledge of what layout was used. I would guess that it's used in MacOS to signify something or other, but I'm developing on Windows and Linux.

Frankly I don't know that either. All I did was convert whatever apple supplies to Unicode Consortium

http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT
0xE4    0x2122+0xF87F   # TRADE MARK SIGN, alternate: sans serif


It seems to me that it would make sense for this extra character to be
removed from the conversion output, since I assume Encode is meant to be
cross-platform compatible. Alternatively, if it's meant to be there,
please let me know as I would be interested to know what the extra
character is used for.

Since the map came from Apple, it is their responsibility to fix the original map. Meanwhile, you can use the quick fix below;

$utf8 = decode("macSymbol", $octet);
$utf8 =~ tr/\x{F87F}//d;

If you are not content with that, you can even make your own encoding. Consult
"perldoc enc2xs" on how to do it.

Thank you,

 - Ciaran.

Thank YOU.

Dan the Encode Maintainer

Reply via email to