Re: Text encoding Babel. Was Re: George Keremedjiev

Grant Taylor via cctalk Sun, 25 Nov 2018 21:52:30 -0800

Not to beat a dead horse, but I ran across "Â Â Â " in a text file whenread via a web browser this evening and wanted to share my findings asthey seemed timely.


On 11/22/18 5:55 PM, Guy Dunphy via cctalk wrote:

Anyway, I was wondering how Ed's emails (and sometimes others elsewhere)acquired that odd corruption.


IMHO it's not corruption as much as it is incompatibility.

Answer: Ed's email util … interpret the user typing space twice insuccession, as meaning "I really, really want there to be a space here,no matter what." So it inserts a 'no-break space' unicode character,which of course requires a 2-byte UTF-8 encoding.

What I'm not sure of is how the 0xC2 0xA0 translates to 0xC3 0xA2 thatis the Â character.

I think that the 0xC2 0xA0 pair is treated as two independentcharacters. Thus 0xC2 is "Â", and 0xA0 is a non-breaking space.

I don't know what happens to the non-breaking space, but the Â and thespace (0x20) that is after 0xC2 0xA0 (three byte sequence being 0xC20xA0 0x20) is included and becomes "Â " which is what we see in replytext. (Encoded as 0xC3 0x83 0x20.)

So, arguably, improperly processed / translated text that results in0xC3 0x83 0x20 / "Â " should have been a non-breaking space followed bya space.

This jives with both Ed's email and the document that I was reading thatprompted this email.

Then adds a plain ASCII space 0x20 just to be sure.

I don't think it's adding a plain ASCII space 0x20 just to be sure.Looking at the source of the message, I see =C2=A0, which is the UTF-8representation followed by the space. My MUA that understands UTF-8shows that "=C2=A0 " translates to " ". Further, "=C2=A0 =C2=A0"translates to " ".

Some of the reading that I did indicates that many things, HTMLincluded, use white space compaction (by default), which means thatmultiple white space characters are reduced to a single white spacecharacter. So, when Ed wants multiple white spaces, his MUA has to dosomething to state that two consecutive spaces can't be compacted.Hence the non-breaking space.

=C2=A0 quite literally translates to a space character that can't becompacted. Thus "=C2=A0 =C2=A0" is really "<space> <space>" or " ".

Multiple successive spaces will need to be a mixture of space andnon-breaking space characters.

So, the plain ASCII space 0x20 after (or before) =C2=A0 is not therejust to be sure.

Personally I find it more interesting than annoying. Just another exampleof the gradual chaotic devolution of ASCII, into a Babel of incompatibleencodings. Not that ASCII was all that great in the first place.

As stated in another reply, I don't think ASCII was ever trying to bethe Babel fish. (Thank you Douglas Adams.)

Takeaway: Ed, one space is enough. I don't know how you got the ideapeople might miss seeing a single space, and so you need to type two ormore.

I wondered if it wasn't a typo or keyboard sensitivity issue. Iremember I had to really slow down the double click speed for my grandpa(R.I.P.) so that he could use the mouse. Maybe some users actuate keysslowly enough that the computer thinks that it's repeated keys. ¯\_(ツ)_/¯

But it isn't so. The normal convention in plain text is one spacecharacter between each word.

The operative word is "convention", as in commonly accepted but notalways the case behavior. ;-)

And since plain ASCII is hard-formatted, extra spaces are NOT ignoredand make for wider spacing between words.

It seems as if you made an assumption. Just because the underlyingcharacter set is ASCII (per RFC 821 & 822, et al) does not mean that thedata that they are carrying is also ASCII. As is evident by theContent-Type: header stating the character set of UTF-8.

Especially when textual white space compression does exactly that,ignore extra white spaces.

Which looks very odd, even if your mail utility didn't try todo something 'special' with your unusual user input.


I frequently use multiple spaces with ASCII diagrams.

+------+
| This |
|  is  |
|   a  |
|  box |
+------+

That will not look like I intended it with white space compression.

Btw, I changed the subject line, because this is a wider topic. I've beenmeaning to start a conversation about the original evolution of ASCII,and various extensions. Related to a side project of mine.


I'm curious to know more about your side project.



--
Grant. . . .
unix || die

Re: Text encoding Babel. Was Re: George Keremedjiev

Reply via email to