Re: [darcs-users] Latin vs. Unicode

Ben Franksen Mon, 17 Nov 2014 13:43:29 -0800

Stephen J. Turnbull wrote:
> Ben Franksen writes:
>  > Over the last years, unicode has established itself world-wide and
>  > firmly and is well supported by all the major operating systems. This
>  > is why I vote for dropping support for older 8-bit encodings that are
>  > not unicode compatible, thereby allowing e.g. Chinese users to use
>  > Darcs with their native languages.
> 
> Does "just dropping 8-bit support" actually enable that, or does it
> only work in a .UTF8 locale?


I am sorry, I should have replied to myself earlier:

Contrary to what I said, it is absolutely not necessary to drop support for 
non-Unicode encodings -- in principle. All we need to require is that there 
is a loss-less conversion from the text (of which we assume it is encoded in 
the current locale) to Unicode for input, and the reverse for output.

Ganesh explained to me why implementing this cleanly would mean a lot of 
effort. The reason is that for a long time Darcs had to work around broken 
IO standard libraries that did not consider encodings at all and simply 
assumed 1:1 correspondence between Char and byte.

> Or does it even work at all?  I have
> trouble imagining how a random 8 bit encoding would get passed in
> verbatim to a widechar Unicode string, which can then be cast to an
> 8-bit encoding that actually comes out the way it went in.

This is exactly what Darcs is currently trying to do and what I want to get 
rid of. And which, of course, breaks as soon as your text translates to code 
points outside of the 8 bit range, which western Europeans tend not to 
notice since their characters mostly lie inside the 8 bit range.

> 8-bit
> encodings (including Latin-1) must be recoded to Unicode, or they
> probably violate the UTF-8 format

Sure. But since at least ghc-7.4 the IO libraries have been fixed and 
correctly encode stuff according to the current locale. So we already get 
everything the user enters properly decoded to String.

Except that Darcs, in order not to destroy all its code based on previously 
necessary work-arounds, forces the IO libraries into a compatibility mode in 
which it thinks the user has a "char8" encoding which means "convert every 
byte verbatim to Char w/o trying to de- or encode".

> (eg, the sequence ASCII-characters
> latin-1-character ASCII-character can never be valid UTF-8, but it's
> extremely common in Latin-1 text).
> 
> Nor do I think you can count on command lines having a .UTF-8 locale.
> Shift JIS and to some extent EUC-JP remain popular in Japan, and at
> least my Chinese students frequently use Big5 and the GB family or
> encodings.  All of these have repertoires that are Unicode subsets,
> but the encodings are different.  Users expect to be able to "cat"
> them to the terminal and read them, and for that use case they will
> have a locale that specifies a default charset other than UTF-8.  Most
> terminals are not able to switch encodings on the fly, so this can be
> extremely inconvenient.

We are in violent agreement here.

> I'm not saying it's not worth doing, but be prepared for quite a bit
> more work than "just dropping 8-bit support."

You are most certainly right that fixing the encoding stuff in Darcs 
properly will be a lot of work.

Cheers
Ben
-- 
"Make it so they have to reboot after every typo." -- Scott Adams


_______________________________________________
darcs-users mailing list
darcs-users@darcs.net
http://lists.osuosl.org/mailman/listinfo/darcs-users

Re: [darcs-users] Latin vs. Unicode

Reply via email to