On Sat, 2016-03-19 at 02:23 +0100, Laszlo Ersek wrote:
> On 03/19/16 02:15, David Woodhouse wrote:
> 
> > So we treat it as an opaque sequence of bytes on the way *in*, then
> > make assumptions on the way *out* about what it was?
> 
> On the way in, it is assumed to be UTF-8, unless the user says
> otherwise. If the user says otherwise (in i18n.commitencoding), that
> statement is captured in the commit object. Either way, the commit
> message is not converted.

Right. That is exactly the problem.

In the legacy world of mixed character sets, *every* string needs to be
explicitly labelled with its character set, and everything handling
strings needs to *either* convert its input to its desired output
charset, or at the very least pass that label along intact.

To *assume* that your input matches your output format (as git-commit
does), and to wilfully ignore the explicit labelling on it (LC_CTYPE,
as you mention), is wrong. You are deliberately *dropping* the label on
the input bytestream, and slapping an incorrect label on it instead.

But it happens. All the time. Even in software written by people who
ought to know better.

Because consistently labelling and converting character sets is hard. 

In fact, there's some room for debate about whether LC_CTYPE *is* the
correct label for the git-commit input. It's possible that there *is*
no right answer here, and no way for git to avoid being buggy.

If you're providing a message with -m on the comment line, then sure,
LC_CTYPE is correct. But when it comes from a file? Every file can come
from a different place, and can be in its *own* character set. I bet
you have *plenty* of text files on your system which aren't in latin2.
And if you're applying a patch with 'git-am' then the temporary file
storing the extracted message is almost *certainly* in UTF-8 or the
original charset from the email, not your LC_CTYPE. Right?

In the legacy charset world we'd need a way to label every text file
with its character set — LC_CTYPE is only an approximation.

But hey, thank $DEITY we solved that in the 21st century by just
standardising on UTF-8 and letting the labelling issue be a thing of
the distant past. :)

Few people fully comprehend the true horror of the "simple" system they
try to cling to...

-- 
dwmw2

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
edk2-devel mailing list
edk2-devel@lists.01.org
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to