> On 02 Aug 2017, at 11:17, André Warnier (tomcat) <a...@ice-sa.com> wrote: > > On 02.08.2017 10:59, Ben RUBSON wrote: >> >>> On 02 Aug 2017, at 10:52, André Warnier (tomcat) <a...@ice-sa.com> wrote: >>> >>> On 01.08.2017 19:30, Ben RUBSON wrote: >>>> Hi, >>>> >>>> The following UTF-8 : >>>> warn("warn with special char ééèè"); >>>> $r->log->error("log with special char ééèè"); >>>> >>>> Produces : >>>> warn with special char ééèè at ... >>>> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client >>>> 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8 >>>> >>>> Why all these \x symbols ? >>> >>> These represent the *bytes* which correspond to the UTF-8 encoding of your >>> "special" characters above. E.g. the character "é" has the Unicode >>> codepoint 233 (decimal) or E9 (hexadecimal). When encoded using the UTF-8 >>> encoding, this is represented by 2 bytes C3 A9 (hexadecimal). The "\x" >>> prefix is a common way to indicate that the symbols which follow should be >>> interpreted as a hexadecimal number. >>> >>> The exact reason why $r->log->error chooses to represent these characters >>> in such a way in the logfile (instead of just printing them as the bytes >>> that constitute their UTF-8 encoding) is not really known to me, but I can >>> make a guess : >>> >>> Internally, perl "knows" that these characters are Unicode. But when it >>> writes them out to a file (such as here the logfile of Apache), it does not >>> necessarily know that this file itself is opened "in UTF-8 mode" and that >>> it can just send the characters that way. >>> So it "escapes" them in a way that will make them readable by a human, no >>> matter what (*). >>> And those are the \x.. (pure ASCII) representations that you see in the >>> logfile. >>> >>> On the other hand, the "warn()" that you also use above, that is perl >>> writing directly to its STDERR. And because that is a file that perl opened >>> itself, it knows that it can handle UTF-8, so it writes these characters >>> directly that way. >>> >>>> How to avoid them ? >>> >>> In this case, I don't know, because it may depend on the way that Apache >>> handles its logfiles, and not only on perl/mod_perl. >>> >>>> >>> (*) for example, no matter which text editor you later use to view the >>> logfile. All text editors can handle ASCII, but not necessarily UTF-8. >>> >>> Ah, and I just saw your follow-up message, and between that and the above, >>> we should have some reasonable explanation together. >> >> Thank you very much for your detailed answer André ! >> Yes Perl must certainly escape UTF-8 characters as you just explained. >> If we convert the string to ascii first (using Encode), these special >> characters are not correctly displayed, this time due to Apache >> ap_escape_errorlog_item() function. >> >> Best thing is then to avoid them :) >> > > Unfortunately, this is not an option when applications have to deal with > multiple languages, and maybe log some important data that just is "not > english" (like names of people, or filenames that people use). > And unfortunately too, that is an issue which often does not seem so > important to a lot of english-native-language programmers, who tend to > consider such characters as indeed "special" and get very confused by them. > To 80% of the people on earth, such characters are not "special" at all; they > are an integral part of their language, just like "a" or "b" are an integral > part of the English language. Hell, I can't even write my own name correctly > without them ! (and neither can a multitude of websites and email programs, > still today. I still get called Andr~O or similar all the time).
Yes you're right, this is an issue if we need to log things such as user input. Supporting the extended ASCII table (up to decimal 255) would at least help a little. We would then be able to correctly log 'André' :) But many characters would still not be supported...