Sorry, ASCII is an exact subset of Unicode encoded in UTF-8. There is no problem for syslog. The problem is unique to Japan. The Japanese predominantly use the family of charactersets called ISO 2022-JP by the IETF. This family includes two 7-bit charactersets in addition to the multi-byte characters for Kanji. One of these 7-bit charactersets is the same as ASCII with the sole exception of the backslash character.
The result is that this Japanese variation is frequently called "ASCII" even though it differs in one character. It is almost always used for displaying and editing files that are supposed to be kept in ASCII. This only introduces problems when the backslash character is used. Many commonly deployed systems do not have a strictly compliant ASCII mode because this is not useful in Japan. You just accept the occasional confusion between the yen sign and the backslash. This worked quite well until CPM and MSDOS started using the ASCII backslash character. Before then it was quite rare in ordinary text files where the proper Japanese backslash is encoded in the Japanese 7-bit set rather than the ASCII 7-bit set. (The Japanese 7-bit set is encoded into the range 128-255 when using ISO 2022-JP.) The problem has gotten worse as people try to adapt to a world where ISO 2022-JP coexists with UTF-8 Unicode. Some user interfaces try to help out by doing character substitutions. This makes things worse, since the logical substitution in Japan is to use the most common Japanese form of backslash, which is not the ASCII backslash. Someday in the distant future the transition to a uniform character coding will be complete, but for the near future these transitional behaviors will exist. I expect that some Japanese systems will use syslog without using multi-byte characters. They will stick to their "ASCII" subset from the ISO 2022-JP family, will not convert their Kanji to Unicode, and will not attempt to send Kanji as part of a syslog message. That is a very reasonable behavior. They are the people who need the reminder about the backslash. There is another whole nest of complexity when you go into the Chinese languages. The problems there are at a more abstract level. The IETF has decided to internationalize using Unicode encoded as UTF-8. The mainland Chinese government has decided to internationalize using GB18030 and requires this by law for some uses. Tthe 7-bit subset of GB18030 is the same as the 7-bit ASCII and the same as the first 128 characters of Unicode encoded as UTF-8. So there will be no conflict there until someone wants to encode Hanzi, Yi, or Mongolian characters into a syslog message. The law mandates support of four languages, not just Mandarin, so you need all those characters. After that you get into a purely legal argument about encodings. As of Unicode version 2.4 (and all subsequent versions) there is a characterset coordination between GB18030 and UTF-8 encoded Unicode so that a single lossless conversion algorithm can be used between the two characterset encodings. The various language committees have also reached agreements on who updates the Chinese language symbols. So you just need to argue that the bit encoding over the wire for syslog does not fall into one of the categories where GB18030 is required by law. I personally think it falls into one of the exceptions. I expect that most Chinese will avoid using non-ASCII characters in the message so that they can process it as though it is GB18030. R Horn "Rainer Gerhards" <[EMAIL PROTECTED] To: Robert Horn/WIL/AGFA/US/[EMAIL PROTECTED], <[EMAIL PROTECTED]> scon.com> cc: "Alexander Clemm (alex)" <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, "Steve Chang (schang99)" <[EMAIL PROTECTED]>, <syslog-sec@employees.org> 06/03/2005 12:09 Subject: RE: [Syslog-sec] Syslog protocol - UTF-8 encoding PM Robert, thanks for all the good points. Just to clarify one thing that me really puzzles: so you are saying the ASCII is actuall NOT a subset of UTF-8, because in Japanese the \ character has been replaced by the Yen sign, so there are two different interpretions of this UTF-8 code? Rainer > -----Original Message----- > From: Robert Horn [mailto:[EMAIL PROTECTED] > Sent: Friday, June 03, 2005 6:04 PM > To: [EMAIL PROTECTED] > Cc: Alexander Clemm (alex); [EMAIL PROTECTED]; Rainer > Gerhards; Steve Chang (schang99); syslog-sec@employees.org > Subject: RE: [Syslog-sec] Syslog protocol - UTF-8 encoding > > > A MUST in the header with SHOULD elsewhere would be > sufficient, but I think > that there is little risk making it a MUST everywhere. ISO > made it into a > MUST with the extensions to 10646-2. The problem is an > oversight in the > UTF-8 specification. It specifies how to take an m-bit > character and break > it down into 8-bit chunks. It was assumed that people would always > minimize the number of 8-bit chunks used, and this is the > general practice. > So if I have a character with a 10-bit code point, it will > get encoded as a > 6-bit and a 4-bit chunk. Then malicious programmers > discovered that they > could get programs to malfunction by using more chunks, e.g. > encoding a > 10-bit code point as two 4-bit chunks and a 2-bit chunk. > Sometimes this > caused buffer overflows and sometimes it lets them evade > 8-bit oriented > regular expression parsers. These are legitimate UTF-8 > encodings because > the UTF-8 specification failed to require minimal size > encodings be used. > I am not aware of any reasonable UTF-8 encoder that does not generate > minimal size encodings. > > I didn't have the text with me while traveling, hence the > uncertainty over > the "space". Specifying the ASCII code value is sufficient. > We probably > should note to readers that the code value used for backslash > in ASCII is > used for the yen symbol in Japan, and that they should be > prepared for user > interface confusion. It is inevitable that there will be > people who use > the Japanese backslash character (a valid UTF-8 character) > instead of the > correct ASCII code value because they are matching what they > see on the > screen with what they see on the keyboard. We should alert > them to the > problem. (Or we could pick another character, but most of the good > characters have already been used for other purposes.) > > R Horn > > > > > > "Anton Okmianski > > > \(aokmians\)" To: > Robert Horn/WIL/AGFA/US/[EMAIL PROTECTED], > <[EMAIL PROTECTED]> > <[EMAIL PROTECTED] cc: > "Alexander Clemm \(alex\)" <[EMAIL PROTECTED]>, > <[EMAIL PROTECTED]>, "Steve > om> Chang > \(schang99\)" <[EMAIL PROTECTED]>, > <syslog-sec@employees.org> > Subject: RE: > [Syslog-sec] Syslog protocol - UTF-8 encoding > > 06/02/2005 03:53 > > > PM > > > > > > > > > > > > > Robert: > > > Potential confusions: > > > > 1) Saying UTF-8 is insufficient. To really cover all the > > bases (especially from a security and string parsing > > perspective) you need to > > say: > > > > "Unicode characters encoded in UTF-8 using the minimal > > encoding." UTF-8 permits a variety of encodings for the same > > character, but only one is the minimal encoding. > > Are you suggesting we make minimum encoding a MUST or a > SHOULD? Everywhere? > > I am fine with a SHOULD everywhere and maybe making it a MUST > for certain > parts of the HEADER, like space separator. However, I think before we > require minimal encoding in PARAM-VALUE and MSG, we should explore the > reasons why UTF-8 allows for different encodings. There may > be good reason > for it. We need to have a good reason to re-define the use of > the standard > for parts of the message which may be received by library > from third-party > applications. My concern is that some perfectly legitimate > UTF-8 code in > the field may not do minimum encoding. Then, we are making > syslog protocol > adoption more difficult by requiring it. > > > For more > > info you can also reference the most recent ISO > > 10646-1 and 10646-2 (with extensions). With minimal > > encodings you eliminate some potential buffer overflows and > > you simplify the use of regular expression matching. It is > > easy enough for an incoming message filter to detect and > > recode UTF-8 into minimal encoding, but you need to say this > > in the specification to inform people that they need the > > filter on the incoming side and that the emitters of messages > > should use the minimal form. > > > > 2) There are multiple blank space characters defined in > > Unicode. These are typographically different. There is only > > one that corresponds to the ASCII blank character and its > > minimal encoding using UTF-8 is intentionally identical to > > the encoding of the ASCII blank character. The confusion may > > be resolved by identifying this Unicode code point by number > > rather than just saying "blank". > > I could not find the word "blank" anywhere in the latest draft. The > encoding defines the space explicitly as: > > SP = %d32 > > Do you think we need to specify more? > > Does UTF-8 allow more than one encoding for basic ASCII > character subset or > only for characters with larger Unicode code points? > > > 3) Not mentioned originally, but also a potential problem, > > are the other homotype and semi-homotype characters. For > > example, there are multiple backslash characters. In fact > > there are three of them in common use, one the ASCII > > character (whose minimal UTF-8 encoding matches the ASCII > > character) and two that are used in Japanese. These are > > pseudo-homotype characters in that a close examination will > > reveal that in a high precision font they are all different > > in size and slope. But in many situations they look the same. > > > > More importantly from the perspective of regular use, the > > ASCII backslash character was replaced in the Japanese 7-bit > > Latin characterset by the Yen symbol. So the Japanese will > > have significant problems regarding use of backslash. Even > > if you specify the use of the proper Unicode character set, > > encoded using minimal size UTF-8, all the backslashes will > > be presented to Japanese users as Yen symbols on most > > systems. These systems make the assumption that what they > > are seeing is the older modified 7-bit > > ASCII that is standard in Japan. This is almost always the correct > > assumption. > > > > There is no simple solution to the backslash problem. The > > backslash should not be given any special meaning in any > > protocol. The various default workarounds for conflicts > > between the older and newer systems introduce a lot of > > confusion around this character. If it has special meaning > > to computers there will always be confusion and problems. If > > you leave it an ordinary non-special character the humans who > > read the message usually have enough context to decide > > whether the character is intended to mean yen or backslash > > and will know from their application context how to interpret > > the text. > > > > If you have messages that must be composed by people and must > > contain backslashes you have an even worse problem. They > > have a backslash character on the keyboard, but it will > > generate the Japanese backslashes, not the ASCII backslash. > > This effectively guarantees problems with entering backslash > > in Japan because people will forget that they need to do > > something special and will just use the keyboard. > > Will this issue be addressed if instead of referring to "\" > when we talk > about escaping it in PARAM-VALUE and using it as escape > sequence, we were > to specifically refer to ASCII character %d92 instead? > > Thanks, > Anton. > > > > _______________________________________________ Syslog-sec mailing list Syslog-sec@www.employees.org http://www.employees.org/mailman/listinfo/syslog-sec