On Tue, Oct 2, 2018 at 5:21 AM Tom Fredrik Blenning <b...@blenning.no> wrote:

>
> On 02/10/2018 02:28, William A Rowe Jr wrote:
> >
> > Very concerned about trusting utf-8 for anything that is expected to be
> > console
> > readable. Not as much the xml/html decorated contents, but the man pages
> > and any text files concern me. I've been living in a utf-8 console for a
> > very
> > long time, but I don't expect my experience is typical.
>
> I would say that this is the usual experience. I've been trying to find
> data on this with no luck, but I would suggest that with perhaps
> exception of the English only world, there is an overwhelming majority
> who uses UTF-8. If you only use ASCII, as in English, UTF-8 and
> ISO-8859-1 becomes the same.
>

It would be great to have that data.


> > I'm also not keen on delivering 1%-3% more bytes over the wire where we
> > can directly represent the contents in an ISO-8859 representation.
>
> Again, if you're using ASCII only, there is no such overhead. Delivering
> entities is a lot more expensive. If at all significant, I expect to see
> bandwidth usage drop and not increase.
>

Adding to that thought, we are just making the jump from &encoded forms,
so any growth in file size is still smaller than the historical
representations.


> I'm a bit out on a limb here, but AFAICR, these symbols are represented
> with 4 byte width and occurs in predictable codepages. It should be easy
> to set up a build test that fails if used. If anything, this should be
> avoided, because even though fairly widely used, they are a relatively
> new introduction. Things may break even though UTF-8 support is in place.
>

Arrows, Misc Tech symbols, Block symbols character blocks all fall into
3-byte width encoding. I agree that more modern emoji etc are more
problematic for several reasons.

Reply via email to