On Tue, Oct 2, 2018 at 5:21 AM Tom Fredrik Blenning <b...@blenning.no> wrote:
> > On 02/10/2018 02:28, William A Rowe Jr wrote: > > > > Very concerned about trusting utf-8 for anything that is expected to be > > console > > readable. Not as much the xml/html decorated contents, but the man pages > > and any text files concern me. I've been living in a utf-8 console for a > > very > > long time, but I don't expect my experience is typical. > > I would say that this is the usual experience. I've been trying to find > data on this with no luck, but I would suggest that with perhaps > exception of the English only world, there is an overwhelming majority > who uses UTF-8. If you only use ASCII, as in English, UTF-8 and > ISO-8859-1 becomes the same. > It would be great to have that data. > > I'm also not keen on delivering 1%-3% more bytes over the wire where we > > can directly represent the contents in an ISO-8859 representation. > > Again, if you're using ASCII only, there is no such overhead. Delivering > entities is a lot more expensive. If at all significant, I expect to see > bandwidth usage drop and not increase. > Adding to that thought, we are just making the jump from &encoded forms, so any growth in file size is still smaller than the historical representations. > I'm a bit out on a limb here, but AFAICR, these symbols are represented > with 4 byte width and occurs in predictable codepages. It should be easy > to set up a build test that fails if used. If anything, this should be > avoided, because even though fairly widely used, they are a relatively > new introduction. Things may break even though UTF-8 support is in place. > Arrows, Misc Tech symbols, Block symbols character blocks all fall into 3-byte width encoding. I agree that more modern emoji etc are more problematic for several reasons.