The copyright symbol is not one of the characters for which there are two
One thing that can confuse people about Unicode is the distinction between the
“code point” and the representation of the code point in the various Unicode
transformation formats such as UTF-8, UTF-16, UTF-32 and so on.
The copyright symbol has code point A9 (represented in hexadecimal) in both
ISO-Latin-1 and Unicode, more commonly written with some leading zeros, e.g.
U+00A9. But when A9 is represented in UTF-8 the actual sequence of bytes in
memory or in a file is C2 followed by A9. In UTF-16 and UTF-32 you will see an
A9 and enough zero bytes to pad to 2 or 4 bytes respectively, but there you
will have the complication that the bytes may be in big-endian or little-endian
order, i.e. A9 00 00 00 for little-endian, or 00 00 00 A9 for big endian.
I always find the www.fileformat.info<http://www.fileformat.info> pages useful
for reference .
From: Shelley Doljack [mailto:sdolj...@stanford.edu]
Sent: 13 November 2015 22:30
To: Highsmith, Anne L; firstname.lastname@example.org
Subject: RE: Opening & writing to UTF-8 files; copyright symbol again --
Hey, that’s my post! Anyways, I haven’t really looked into what your problem
is, but when you said that the copyright character is getting transformed to A9
even though it is supposedly stored as C2 A9 in the database, it made me think
of how there can be two UTF-8 representations for the same character in some
sections of the Unicode set. I wonder if that is somehow happening for you.