On 02.08.2017 10:59, Ben RUBSON wrote:

On 02 Aug 2017, at 10:52, André Warnier (tomcat) <a...@ice-sa.com> wrote:

On 01.08.2017 19:30, Ben RUBSON wrote:
Hi,

The following UTF-8 :
warn("warn with special char ééèè");
$r->log->error("log with special char ééèè");

Produces :
warn with special char ééèè at ...
[Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 
127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8

Why all these \x symbols ?

These represent the *bytes* which correspond to the UTF-8 encoding of your "special" characters 
above. E.g. the character "é" has the Unicode codepoint 233 (decimal) or E9 (hexadecimal). When 
encoded using the UTF-8 encoding, this is represented by 2 bytes C3 A9 (hexadecimal). The "\x" 
prefix is a common way to indicate that the symbols which follow should be interpreted as a hexadecimal 
number.

The exact reason why $r->log->error chooses to represent these characters in 
such a way in the logfile (instead of just printing them as the bytes that constitute 
their UTF-8 encoding) is not really known to me, but I can make a guess :

Internally, perl "knows" that these characters are Unicode.  But when it writes them out 
to a file (such as here the logfile of Apache), it does not necessarily know that this file itself 
is opened "in UTF-8 mode" and that it can just send the characters that way.
So it "escapes" them in a way that will make them readable by a human, no 
matter what (*).
And those are the \x.. (pure ASCII) representations that you see in the logfile.

On the other hand, the "warn()" that you also use above, that is perl writing 
directly to its STDERR. And because that is a file that perl opened itself, it knows that 
it can handle UTF-8, so it writes these characters directly that way.

How to avoid them ?

In this case, I don't know, because it may depend on the way that Apache 
handles its logfiles, and not only on perl/mod_perl.


(*) for example, no matter which text editor you later use to view the logfile. 
All text editors can handle ASCII, but not necessarily UTF-8.

Ah, and I just saw your follow-up message, and between that and the above, we 
should have some reasonable explanation together.

Thank you very much for your detailed answer André !
Yes Perl must certainly escape UTF-8 characters as you just explained.
If we convert the string to ascii first (using Encode), these special 
characters are not correctly displayed, this time due to Apache 
ap_escape_errorlog_item() function.

Best thing is then to avoid them :)


Unfortunately, this is not an option when applications have to deal with multiple languages, and maybe log some important data that just is "not english" (like names of people, or filenames that people use). And unfortunately too, that is an issue which often does not seem so important to a lot of english-native-language programmers, who tend to consider such characters as indeed "special" and get very confused by them. To 80% of the people on earth, such characters are not "special" at all; they are an integral part of their language, just like "a" or "b" are an integral part of the English language. Hell, I can't even write my own name correctly without them ! (and neither can a multitude of websites and email programs, still today. I still get called Andr~O or similar all the time).




Reply via email to