Not to beat a dead horse, but I ran across "Â Â Â " in a text file when read via a web browser this evening and wanted to share my findings as they seemed timely.

On 11/22/18 5:55 PM, Guy Dunphy via cctalk wrote:
Anyway, I was wondering how Ed's emails (and sometimes others elsewhere) acquired that odd corruption.

IMHO it's not corruption as much as it is incompatibility.

Answer: Ed's email util … interpret the user typing space twice in succession, as meaning "I really, really want there to be a space here, no matter what." So it inserts a 'no-break space' unicode character, which of course requires a 2-byte UTF-8 encoding.

What I'm not sure of is how the 0xC2 0xA0 translates to 0xC3 0xA2 that is the  character.

I think that the 0xC2 0xA0 pair is treated as two independent characters. Thus 0xC2 is "Â", and 0xA0 is a non-breaking space.

I don't know what happens to the non-breaking space, but the  and the space (0x20) that is after 0xC2 0xA0 (three byte sequence being 0xC2 0xA0 0x20) is included and becomes " " which is what we see in reply text. (Encoded as 0xC3 0x83 0x20.)

So, arguably, improperly processed / translated text that results in 0xC3 0x83 0x20 / "Â " should have been a non-breaking space followed by a space.

This jives with both Ed's email and the document that I was reading that prompted this email.

Then adds a plain ASCII space 0x20 just to be sure.

I don't think it's adding a plain ASCII space 0x20 just to be sure. Looking at the source of the message, I see =C2=A0, which is the UTF-8 representation followed by the space. My MUA that understands UTF-8 shows that "=C2=A0 " translates to " ". Further, "=C2=A0 =C2=A0" translates to " ".

Some of the reading that I did indicates that many things, HTML included, use white space compaction (by default), which means that multiple white space characters are reduced to a single white space character. So, when Ed wants multiple white spaces, his MUA has to do something to state that two consecutive spaces can't be compacted. Hence the non-breaking space.

=C2=A0 quite literally translates to a space character that can't be compacted. Thus "=C2=A0 =C2=A0" is really "<space> <space>" or " ".

Multiple successive spaces will need to be a mixture of space and non-breaking space characters.

So, the plain ASCII space 0x20 after (or before) =C2=A0 is not there just to be sure.

Personally I find it more interesting than annoying. Just another example of the gradual chaotic devolution of ASCII, into a Babel of incompatible encodings. Not that ASCII was all that great in the first place.

As stated in another reply, I don't think ASCII was ever trying to be the Babel fish. (Thank you Douglas Adams.)

Takeaway: Ed, one space is enough. I don't know how you got the idea people might miss seeing a single space, and so you need to type two or more.

I wondered if it wasn't a typo or keyboard sensitivity issue. I remember I had to really slow down the double click speed for my grandpa (R.I.P.) so that he could use the mouse. Maybe some users actuate keys slowly enough that the computer thinks that it's repeated keys. ¯\_(ツ)_/¯

But it isn't so. The normal convention in plain text is one space character between each word.

The operative word is "convention", as in commonly accepted but not always the case behavior. ;-)

And since plain ASCII is hard-formatted, extra spaces are NOT ignored and make for wider spacing between words.

It seems as if you made an assumption. Just because the underlying character set is ASCII (per RFC 821 & 822, et al) does not mean that the data that they are carrying is also ASCII. As is evident by the Content-Type: header stating the character set of UTF-8.

Especially when textual white space compression does exactly that, ignore extra white spaces.

Which looks very odd, even if your mail utility didn't try to do something 'special' with your unusual user input.

I frequently use multiple spaces with ASCII diagrams.

+------+
| This |
|  is  |
|   a  |
|  box |
+------+

That will not look like I intended it with white space compression.

Btw, I changed the subject line, because this is a wider topic. I've been meaning to start a conversation about the original evolution of ASCII, and various extensions. Related to a side project of mine.

I'm curious to know more about your side project.



--
Grant. . . .
unix || die

Reply via email to