On Sat, 2016-03-19 at 02:23 +0100, Laszlo Ersek wrote: > On 03/19/16 02:15, David Woodhouse wrote: > > > So we treat it as an opaque sequence of bytes on the way *in*, then > > make assumptions on the way *out* about what it was? > > On the way in, it is assumed to be UTF-8, unless the user says > otherwise. If the user says otherwise (in i18n.commitencoding), that > statement is captured in the commit object. Either way, the commit > message is not converted.
Right. That is exactly the problem. In the legacy world of mixed character sets, *every* string needs to be explicitly labelled with its character set, and everything handling strings needs to *either* convert its input to its desired output charset, or at the very least pass that label along intact. To *assume* that your input matches your output format (as git-commit does), and to wilfully ignore the explicit labelling on it (LC_CTYPE, as you mention), is wrong. You are deliberately *dropping* the label on the input bytestream, and slapping an incorrect label on it instead. But it happens. All the time. Even in software written by people who ought to know better. Because consistently labelling and converting character sets is hard. In fact, there's some room for debate about whether LC_CTYPE *is* the correct label for the git-commit input. It's possible that there *is* no right answer here, and no way for git to avoid being buggy. If you're providing a message with -m on the comment line, then sure, LC_CTYPE is correct. But when it comes from a file? Every file can come from a different place, and can be in its *own* character set. I bet you have *plenty* of text files on your system which aren't in latin2. And if you're applying a patch with 'git-am' then the temporary file storing the extracted message is almost *certainly* in UTF-8 or the original charset from the email, not your LC_CTYPE. Right? In the legacy charset world we'd need a way to label every text file with its character set — LC_CTYPE is only an approximation. But hey, thank $DEITY we solved that in the 21st century by just standardising on UTF-8 and letting the labelling issue be a thing of the distant past. :) Few people fully comprehend the true horror of the "simple" system they try to cling to... -- dwmw2
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel