Re: confusing bullets

Sherm Pendley Sat, 10 Jan 2004 23:48:22 -0800

On Jan 10, 2004, at 9:26 PM, Vic Norton wrote:

How come Perl sees "C2 A0" whenever HexEdit sees "CA" and visa versa? I don't care what kind of characters we are talking here. To paraphrase Gertrude Stein, "a byte is a byte is a byte." At least that's what I thought until now.

Like John said - text encoding.

The file you're viewing with HexEdit is most likely encoded using MacRoman, or possibly ISO 8859-1. Internally, Perl uses UTF8 encoding.

Try this: Create a new text file in BBEdit, and enter a bullet (opt-8). Save it using the default text encoding. HexEdit shows a single byte in the file: A5. Now, open the file again, and save a copy of it using UTF8 encoding with no byte-order mark. HexEdit now shows *three* bytes: E2 80 A2. And, you have to tell BBEdit what encoding the file uses when you open it - without the byte-order mark, BBEdit can't tell it's UTF8.

Just for grins, save it again, this time *with* the byte-order mark. HexEdit now reports *six* bytes in the file: EF BB BF E2 80 A2.

In other words, yes - a byte is a byte is a byte. But you're not working with bytes, you're working with text. A character is not always a byte. It can be several bytes, depending on how it's encoded.

sherm--

Re: confusing bullets

Reply via email to