RE: [Syslog-sec] Syslog protocol - UTF-8 encoding

Robert Horn Fri, 03 Jun 2005 10:40:05 -0700

Sorry, ASCII is an exact subset of Unicode encoded in UTF-8.  There is no
problem for syslog.  The problem is unique to Japan.  The Japanese
predominantly use the family of charactersets called ISO 2022-JP by the
IETF.  This family includes two 7-bit charactersets in addition to the
multi-byte characters for Kanji.  One of these 7-bit charactersets is the
same as ASCII with the sole exception of the backslash character.


The result is that this Japanese variation is frequently called "ASCII"
even though it differs in one character.  It is almost always used for
displaying and editing files that are supposed to be kept in ASCII.  This
only introduces problems when the backslash character is used.  Many
commonly deployed systems do not have a strictly compliant ASCII mode
because this is not useful in Japan.  You just accept the occasional
confusion between the yen sign and the backslash.  This worked quite well
until CPM and MSDOS started using the ASCII backslash character.  Before
then it was quite rare in ordinary text files where the proper Japanese
backslash is encoded in the Japanese 7-bit set rather than the ASCII 7-bit
set.  (The Japanese 7-bit set is encoded into the range 128-255 when using
ISO 2022-JP.)

The problem has gotten worse as people try to adapt to a world where ISO
2022-JP coexists with UTF-8 Unicode.  Some user interfaces try to help out
by doing character substitutions.  This makes things worse, since the
logical substitution in Japan is to use the most common Japanese form of
backslash, which is not the ASCII backslash.   Someday in the distant
future the transition to a uniform character coding will be complete, but
for the near future these transitional behaviors will exist.  I expect that
some Japanese systems will use syslog without using multi-byte characters.
They will stick to their "ASCII" subset from the ISO 2022-JP family, will
not convert their Kanji to Unicode, and will not attempt to send Kanji as
part of a syslog message.  That is a very reasonable behavior.  They are
the people who need the reminder about the backslash.

There is another whole nest of complexity when you go into the Chinese
languages.  The problems there are at a more abstract level.  The IETF has
decided to internationalize using Unicode encoded as UTF-8.  The mainland
Chinese government has decided to internationalize using GB18030 and
requires this by law for some uses.   Tthe 7-bit subset of GB18030 is the
same as the 7-bit ASCII and the same as the first 128 characters of Unicode
encoded as UTF-8.  So there will be no conflict there until someone wants
to encode Hanzi, Yi, or Mongolian characters into a syslog message.   The
law mandates support of four languages, not just Mandarin, so you need all
those characters.

After that you get into a purely legal argument about encodings.  As of
Unicode version 2.4 (and all subsequent versions) there is a characterset
coordination between GB18030 and UTF-8 encoded Unicode so that a single
lossless conversion algorithm can be used between the two characterset
encodings.  The various language committees have also reached agreements on
who updates the Chinese language symbols.  So you just need to argue that
the bit encoding over the wire for syslog does not fall into one of the
categories where GB18030 is required by law.  I personally think it falls
into one of the exceptions.  I expect that most Chinese will avoid using
non-ASCII characters in the message so that they can process it as though
it is GB18030.

R Horn



                                                                                
                                                     
                      "Rainer Gerhards"                                         
                                                     
                      <[EMAIL PROTECTED]        To:       Robert 
Horn/WIL/AGFA/US/[EMAIL PROTECTED], <[EMAIL PROTECTED]>                    
                      scon.com>                cc:       "Alexander Clemm 
(alex)" <[EMAIL PROTECTED]>, <[EMAIL PROTECTED]>, "Steve    
                                                Chang (schang99)" <[EMAIL 
PROTECTED]>, <syslog-sec@employees.org>                   
                      06/03/2005 12:09         Subject:  RE: [Syslog-sec] 
Syslog protocol - UTF-8 encoding                           
                      PM                                                        
                                                     
                                                                                
                                                     
                                                                                
                                                     




Robert,

thanks for all the good points. Just to clarify one thing that me really
puzzles: so you are saying the ASCII is actuall NOT a subset of UTF-8,
because in Japanese the \ character has been replaced by the Yen sign,
so there are two different interpretions of this UTF-8 code?

Rainer

> -----Original Message-----
> From: Robert Horn [mailto:[EMAIL PROTECTED]
> Sent: Friday, June 03, 2005 6:04 PM
> To: [EMAIL PROTECTED]
> Cc: Alexander Clemm (alex); [EMAIL PROTECTED]; Rainer
> Gerhards; Steve Chang (schang99); syslog-sec@employees.org
> Subject: RE: [Syslog-sec] Syslog protocol - UTF-8 encoding
>
>
> A MUST in the header with SHOULD elsewhere would be
> sufficient, but I think
> that there is little risk making it a MUST everywhere.  ISO
> made it into a
> MUST with the extensions to 10646-2.   The problem is an
> oversight in the
> UTF-8 specification.  It specifies how to take an m-bit
> character and break
> it down into 8-bit chunks.  It was assumed that people would always
> minimize the number of 8-bit chunks used, and this is the
> general practice.
> So if I have a character with a 10-bit code point, it will
> get encoded as a
> 6-bit and a 4-bit chunk.  Then malicious programmers
> discovered that they
> could get programs to malfunction by using more chunks, e.g.
> encoding a
> 10-bit code point as two 4-bit chunks and a 2-bit chunk.
> Sometimes this
> caused buffer overflows and sometimes it lets them evade
> 8-bit oriented
> regular expression parsers.  These are legitimate UTF-8
> encodings because
> the UTF-8 specification failed to require minimal size
> encodings be used.
> I am not aware of any reasonable UTF-8 encoder that does not generate
> minimal size encodings.
>
> I didn't have the text with me while traveling, hence the
> uncertainty over
> the "space".  Specifying the ASCII code value is sufficient.
> We probably
> should note to readers that the code value used for backslash
> in ASCII is
> used for the yen symbol in Japan, and that they should be
> prepared for user
> interface confusion.  It is inevitable that there will be
> people who use
> the Japanese backslash character (a valid UTF-8 character)
> instead of the
> correct ASCII code value because they are matching what they
> see on the
> screen with what they see on the keyboard.  We should alert
> them to the
> problem.  (Or we could pick another character, but most of the good
> characters have already been used for other purposes.)
>
> R Horn
>
>
>
>
>
>                       "Anton Okmianski
>
>
>                       \(aokmians\)"            To:
> Robert Horn/WIL/AGFA/US/[EMAIL PROTECTED],
> <[EMAIL PROTECTED]>
>                       <[EMAIL PROTECTED]        cc:
> "Alexander Clemm \(alex\)" <[EMAIL PROTECTED]>,
> <[EMAIL PROTECTED]>, "Steve
>                       om>                       Chang
> \(schang99\)" <[EMAIL PROTECTED]>,
> <syslog-sec@employees.org>
>                                                Subject:  RE:
> [Syslog-sec] Syslog protocol - UTF-8 encoding
>
>                       06/02/2005 03:53
>
>
>                       PM
>
>
>
>
>
>
>
>
>
>
>
>
> Robert:
>
> > Potential confusions:
> >
> >   1) Saying UTF-8 is insufficient.  To really cover all the
> > bases (especially from a security  and string parsing
> > perspective) you need to
> > say:
> >
> > "Unicode characters encoded in UTF-8 using the minimal
> > encoding."  UTF-8 permits a variety of encodings for the same
> > character, but only one is the minimal encoding.
>
> Are you suggesting we make minimum encoding a MUST or a
> SHOULD? Everywhere?
>
> I am fine with a SHOULD everywhere and maybe making it a MUST
> for certain
> parts of the HEADER, like space separator.  However, I think before we
> require minimal encoding in PARAM-VALUE and MSG, we should explore the
> reasons why UTF-8 allows for different encodings.  There may
> be good reason
> for it. We need to have a good reason to re-define the use of
> the standard
> for parts of the message which may be received by library
> from third-party
> applications.  My concern is that some perfectly legitimate
> UTF-8 code in
> the field may not do minimum encoding.  Then, we are making
> syslog protocol
> adoption more difficult by requiring it.
>
> > For more
> > info you can also reference the most recent ISO
> > 10646-1 and 10646-2 (with extensions).  With minimal
> > encodings you eliminate some potential buffer overflows and
> > you simplify the use of regular expression matching.  It is
> > easy enough for an incoming message filter to detect and
> > recode UTF-8 into minimal encoding, but you need to say this
> > in the specification to inform people that they need the
> > filter on the incoming side and that the emitters of messages
> > should use the minimal form.
> >
> >   2) There are multiple blank space characters defined in
> > Unicode.  These are typographically different.  There is only
> > one that corresponds to the ASCII blank character and  its
> > minimal encoding using UTF-8 is intentionally identical to
> > the encoding of the ASCII blank character.  The confusion may
> > be resolved by identifying this Unicode code point by number
> > rather than just saying "blank".
>
> I could not find the word "blank" anywhere in the latest draft. The
> encoding defines the space explicitly as:
>
> SP = %d32
>
> Do you think we need to specify more?
>
> Does UTF-8 allow more than one encoding for basic ASCII
> character subset or
> only for characters with larger Unicode code points?
>
> >   3) Not mentioned originally, but also a potential problem,
> > are the other homotype and semi-homotype characters.  For
> > example, there are multiple backslash characters.  In fact
> > there are three of them in common use, one the ASCII
> > character (whose minimal UTF-8 encoding matches the ASCII
> > character) and two that are used in Japanese.  These are
> > pseudo-homotype characters in that a close examination will
> > reveal that in a high precision font they are all different
> > in size and slope.  But in many situations they look the same.
> >
> > More importantly from the perspective of regular use, the
> > ASCII backslash character was replaced in the Japanese 7-bit
> > Latin characterset by the Yen symbol.  So the Japanese will
> > have significant problems regarding use of backslash.  Even
> > if you specify the use of the proper Unicode character set,
> > encoded using minimal size UTF-8,  all the backslashes will
> > be presented to Japanese users as Yen symbols on most
> > systems.  These systems make the assumption that what they
> > are seeing is the older modified 7-bit
> > ASCII that is standard in Japan.   This is almost always the correct
> > assumption.
> >
> > There is no simple solution to the backslash problem.  The
> > backslash should not be given any special meaning in any
> > protocol.  The various default workarounds for conflicts
> > between the older and newer systems introduce a lot of
> > confusion around this character.  If it has special meaning
> > to computers there will always be confusion and problems.  If
> > you leave it an ordinary non-special character the humans who
> > read the message usually have enough context to decide
> > whether the character is intended to mean yen or backslash
> > and will know from their application context how to interpret
> > the text.
> >
> > If you have messages that must be composed by people and must
> > contain backslashes you have an even worse problem.  They
> > have a backslash character on the keyboard, but it will
> > generate the Japanese backslashes, not the ASCII backslash.
> > This effectively guarantees problems with entering backslash
> > in Japan because people will forget that they need to do
> > something special and will just use the keyboard.
>
> Will this issue be addressed if instead of referring to "\"
> when we talk
> about escaping it in PARAM-VALUE and using it as escape
> sequence, we were
> to specifically refer to ASCII character %d92 instead?
>
> Thanks,
> Anton.
>
>
>
>



_______________________________________________
Syslog-sec mailing list
Syslog-sec@www.employees.org
http://www.employees.org/mailman/listinfo/syslog-sec

RE: [Syslog-sec] Syslog protocol - UTF-8 encoding

Reply via email to