Ciaran,
Thank you for your concern.
On Feb 22, 2006, at 00:14 , Ciaran Hamilton wrote:
Hi Dan,
I'm working with the Encode module in my Perl programs fairly
extensively,
and firstly I have to thank you for maintaining this. It's such a
brilliant module as it makes life so much easier when dealing with
encodings. Thank you!
I have a question about the MacSymbol encoding as implemented by
Encode.
In my tests (and as seen at
http://search.cpan.org/src/DANKOGAI/Encode-2.14/ucm/
macSymbol.ucm ), it
seems that the character at position 0xE4 in the MacSymbol charset
seems
to be converted by decode() to two Unicode characters - U+2122 and U
+F87F.
You mean this.
<U2122><UF87F> \xE4 |3 # TRADE MARK SIGN, alternate: sans serif
perldoc enc2xs
o CHARMAP starts the character map section. Each line has
a form as
follows:
<UXXXX> \xXX.. |0 # comment
^ ^ ^
| | +- Fallback flag
| +-------- Encoded byte sequence
+-------------- Unicode Character ID in hex
The format is roughly the same as a header section
except for the
fallback flag: | followed by 0..3. The meaning of the
possible
values is as follows:
|0 Round trip safe. A character decoded to Unicode
encodes back
to the same byte sequence. Most characters have
this flag.
|1 Fallback for unicode -> encoding. When seen, enc2xs
adds this
character for the encode map only.
|2 Skip sub-char mapping should there be no code point.
|3 Fallback for encoding -> unicode. When seen, enc2xs
adds this
character for the decode map only.
Since it is marked |3, this map is used for decode() only.
It seems that the second character is part of the Private Use area,
and
because of this, the second character is meaningless to me, without
knowledge of what layout was used. I would guess that it's used in
MacOS
to signify something or other, but I'm developing on Windows and
Linux.
Frankly I don't know that either. All I did was convert whatever
apple supplies to Unicode Consortium
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/SYMBOL.TXT
0xE4 0x2122+0xF87F # TRADE MARK SIGN, alternate: sans serif
It seems to me that it would make sense for this extra character to be
removed from the conversion output, since I assume Encode is meant
to be
cross-platform compatible. Alternatively, if it's meant to be there,
please let me know as I would be interested to know what the extra
character is used for.
Since the map came from Apple, it is their responsibility to fix the
original map. Meanwhile, you can use the quick fix below;
$utf8 = decode("macSymbol", $octet);
$utf8 =~ tr/\x{F87F}//d;
If you are not content with that, you can even make your own
encoding. Consult
"perldoc enc2xs" on how to do it.
Thank you,
- Ciaran.
Thank YOU.
Dan the Encode Maintainer