RE: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)

Rainer Gerhards Thu, 08 Dec 2005 07:55:52 -0800

Chris,

I can agree to what you propose.  So it's fine with me.


Question: does it make any sense to answer some of Patrik's questions (in order 
to obtain some more advise). I guess he is pretty busy, so we might save this 
for later. I'd appreciate your advise.

Rainer

> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick
> Sent: Wednesday, December 07, 2005 8:11 PM
> To: [EMAIL PROTECTED]
> Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)
> 
> Hi Folks,
> 
> I asked Patrik Faltstrom to review this proposal.  He has 
> some comments 
> below.  Let's don't get hung up in his details - he has 
> looked this over 
> without any knowledge of our prior discussions.  He does have 
> some good 
> pointers.
> 
> We may want to consider a "belt and suspenders" approach.
> 
> - senders MAY indicate their charset in the SD-ID.  If the 
> SD-ID does not 
> contain any indication of a charset, then the receiver will 
> just have to 
> guess (it may be US-ASCII or it may be something entirely different). 
> Having the UTF-8 BOM there would be a good indication that it 
> is UTF-8.
> 
> - senders are RECOMMENDED to include a charset indicator in 
> the SD-ID. 
> The ONLY one defined in the syslog-protocol will be 
> [charset="UTF-8"]. 
> When that is specified, then the BOM MUST be present.
> 
> To address Bazsi's concerns of too many charset definitions, 
> Rainer could 
> indicated that additional charset values can only be accepted 
> by the IANA 
> through Standards Action (RFC 2434).
> 
> As Patrik indicates, it would be good to see this separated into
> - what can the sender send
> - what will the receiver expect to receive.
> 
> 
> I would like to see other comments on this proposal.  I need 
> to review the 
> threads but I believe that we have rough consensus on all of 
> the other 
> issues so that Rainer can re-work syslog-protocol.
> 
> Thanks,
> Chris
> 
> PAF's comments below >>>
> 
> 
> ---------- Forwarded message ----------
> Date: Wed, 7 Dec 2005 17:23:24 +0100
> From: "[ISO-8859-1] Patrik Fältström" 
> To: Chris Lonvick <[EMAIL PROTECTED]>
> Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)
> 
> > Let's first quickly review what has been discussed on list:
> > 
> > - current implementations sometimes use LF as a record delimiter
> 
> Ok
> 
> > - some implementations use LF inside the MSG part
> 
> Ok
> 
> > - some implementations include binary data in syslog messages
> >   and would like to continue to do so (but these seem to be few)
> 
> Ok
> 
> > - there are at least some use cases where a syslogd can not
> >   definitely detect the character encoding of a message
> >   (some of that might be related to the POSIX API, but there
> >   may be a work-around [I had no time yet to evaluate this
> >   in-depth]). It gets problematic if a message from a legacy
> >   sender is received (no encoding information) and transformed
> >   into a syslog-protocol message [I assume this is a valid 
> use-case])
> 
> Ok
> 
> > - previous discussion showed the need for Unicode. With Unicode, the
> >   term "printable character" basically becomes useless, 
> because there
> >   are so many non-printable characters in Unicode and new ones are
> >   potentially added constantly.
> 
> Well...define "printable"... I don't really know what that means.
> 
> > - previous consensus thus was that any valid UTF-8 string MUST
> >   be supported inside MSG (including NUL and LF)
> 
> NULL and LF are part of Unicode, and because of that UTF-8. 
> The encoding UTF-8 
> encode NUL and LF as one byte only, with the same value as LF 
> and NUL as we are 
> used to.
> 
> > - current discussion has shown that backwards-compatibility
> >   is not absolutely vital (but still desirable)
> 
> Ok. Solves some of the binary problems.
> 
> > - it was suggested that an "encoding SD-ID" be defined which
> >   carries the character set definition
> 
> Hmmm....why is the charset definition needed? That is then to 
> be able to say 
> UTF-8 or BIG5 or...? It seems to be better and more important 
> to say whether it 
> is UTF-8 or for example binhex encoded binary data.
> 
> Remember that the main difference between text and binary is 
> that text is to be 
> converted regarding linebreak algorithms, while binary data is not.
> 
> > - as a side-note, Tom Petch has provided a very good digression
> >   on "character encoding" terminology which I have reproduced after
> >   my signature. I guess most people on this list already know the
> >   exact differences, but I still find it useful...
> 
> Ok. Can not remember I have seen it, but anyway...
> 
> > It is somewhat hard to find a good compromise. A 
> compromise, in my point
> > of view, must allow the following:
> 
> When looking at a protocol like this, you have to first of 
> all define whether 
> the charset translation/transformation is happening in the 
> client or in the 
> server. This is not really clear to me. If the transformation 
> is in the client, 
> the client translate to for example UTF-8. It can also be the 
> server doing it. 
> (Or of course a client that read from wherever the syslog 
> daemon store the 
> data, so that the storage can handle multiple charsets...but 
> I think this is 
> out of the question?)
> 
> > - transforming existing messages into -protocol format should
> >   not intentionally be forbidden - transformation is a very
> >   important "feature" when it comes to deploying new technology
> 
> Yup.
> 
> > - new receivers should be able to precisely "enough" understand
> >   the message content
> 
> Ok. Message content from old senders?
> 
> > - I also find it advisable that newer receivers are capable
> >   to process both old-style and new-style messages concurrently.
> >   While this is an implementation issue, it might be a hint for
> >   us that some subleties in character encoding must be dealt
> >   with in any case.
> 
> Ok.
> 
> > - we should try NOT to include the myriad of possible encoding
> >   technologies, at least not promote this for needs other than
> >   backwards compatibility
> 
> You have to differ between:
> 
> - The protocol have the ability to handle any encoding technology
> - What encoding technologies to have as a MUST or SHOULD implement
> 
> Two different things.
> 
> > To solve the encoding issue, an "encoding" SD-ID has been 
> proposed that
> > describes the encoding of the MSG part (I do not use 
> precise wording on
> > which encoding, simply because it is not relevant in this 
> context - read
> > on...). This SD-ID would by its very nature be optional. I follow
> > Darren's reminder that truncation can always make SD-IDs 
> (all or part)
> > disappear. As such, the encoding specification would not be 
> guaranteed
> > to be received by the final destination. This contradicts with the
> > intension of that SD-ID: it's ultimate purpose was to enable the
> > receiver to use proper decoding for the MSG part.
> 
> Ok.
> 
> If you talk about truncation, the important thing is that the 
> encoding 
> information is coming before the data that is encoded, so the 
> data and not the 
> meta-information is truncated, if any.
> 
> > Of course, this also raises the question if the SD-ID 
> concept is good
> > enough. For obvious reasons it suffers from the lack of 
> reliability. I
> > think this in general is acceptable. The only cure would be to bring
> > reliablity and thus full-duplex communication to syslog. This is way
> > beyond our charter (if you like this, you should probably 
> join NETCONF
> > and help on NETCONF notifications). We have addressed this 
> concern by
> > moving all absolutely vital data to the header. If we allow multiple
> > encodings, the information about the encoding belongs into 
> the header,
> > so we would have another header field. While this is a 
> solution, I think
> > it is overengineered for what we actually need.
> 
> Ok.
> 
> > Let us keep in mind that our ultimate desire is to have as 
> many messages
> > as possible use Unicode (CCS) and be UTF-8 encoded (CES), with with
> > UTF-8 also being the transfer encoding (Tom: I hope I got 
> it right ;)).
> 
> In IETF, we say "the charset is UTF-8", and with that we 
> imply Unicode is the 
> character set.
> 
> So, don't get stuck in the details.
> 
> See RFC 3629. Just reference that.
> 
> Note byte order.
> 
> > Any other encoding should only be supported for backward 
> compatibility
> > either at the protocol level (transforming relays) or to leverage
> > existing APIs (POSIX et al). So we are accepting the fact that other
> > encodings need to be used, but we do not really like it (at least I
> > don't).
> > 
> > Assigning a header field for such a somewhat auxiluary 
> feature would put
> > to much weight on it and may even promote its use.
> > 
> > So I am now back to the proposal with the Unicode BOM. Let's keep in
> > mind that we either a) know the character set [then we can 
> convert to
> > Unicode]
> 
> No, not really. You can not do a proper conversion without 
> loosing data. The 
> question is whether you include the conversion as part of the 
> protocol. Who is 
> doing the conversion? Is  a non-UTF-8 charset allowed in the 
> protocol? In that 
> case, the receiver of the message is supposed to do the 
> translation...right?
> 
> > or b) we do not know it [then we can convey no information
> > about it, because else we would actually have case a)]. So a simple
> > indication whether or not MSG contains UTF-8 would be sufficient.
> 
> New-style is no problem. Old style is hard.
> 
> > I hereby propose that we RECOMMEND to use UTF-8 in all 
> cases where this
> > is possible. If UTF-8 is used, the MSG field MUST be prefixed by the
> > properly-encoded Unicode BOM (a 3-octet overhead).
> 
> See http://www.unicode.org/faq/utf_bom.html#29
> 
> You can not enforce this I think. I think you should instead 
> have a proper 
> header that say whether this is text and whether it is UTF-8.
> 
> > Any other encoding
> > MAY be used. In this case the MSG field MUST NOT start with 
> the octet
> > values of the 3-octet UTF-8 encodede Unicode BOM.
> 
> I don't think you can say this. You don't know what other 
> charset's might use 
> as bytes.
> 
> And, how do you know what charset is in use?
> 
> How do you know what is binary and not text?
> 
> > If necessary, a SP
> > MUST be inserted before this sequence. Such recommendations 
> is within
> > the expectation of a typical Unicode user/developer (at 
> least I strongly
> > think so).
> 
> What is "SP"? Space I guess. If one use UTF-16, space is not 
> one byte...and in 
> EBCDIC I don't know what space is either. I think you talk 
> about a specific 
> byte-value here, and not "space" as you don't know what to 
> look for when you 
> don't know what charset is in use.
> 
> > The specification of other encodings, if there is an actual 
> need for it,
> > should be left for a separate document. That document 
> should specify how
> > to enhance syslog message content in a way inspired by 
> MIME. I expect
> > such an document to make use of SD-IDs to acomplish its 
> goal. That would
> > obviously again be subject to truncation. Here, I find this 
> acceptable,
> > because
> 
> Ok.
> 
> > a) any -protocol compliant receiver would still be able to 
> process the
> > message, at least in a basic way (thanks to the BOM)
> > b) specific maximum minimum size restrictions can be placed 
> on compliant
> > receivers supporting such a specification
> > 
> > That "encoding" document should also address the natural
> > language/culture information, which I think we should not move into
> > -protocol.
> 
> Ok.
> 
> Possible to have alternative formats?
> 
> > If we assume the encoding is solved, we still have not 
> decided on NUL,
> > LF and other US-ASCII control characters. If we look at 
> existing syslog
> > implementations, most of them use LF control characters as a kind of
> > framing (End of Record - EOR - markers). Other control 
> characters are
> > simply escaped. Plain binary data is very seldomly seen. NUL causes
> > confusion to many existing receivers.
> 
> If you use UTF-8, you are fine.
> 
> > We can now ask ourselfs: what problem does it cause if a 
> sender sends a
> > control character (e.g. BEL) and a relay transforms it to an escaped
> > form (e.g. '^07'). If we follow this route, we see that 
> there is nothing
> > bad with it per se. It becomes a problem only if a digital 
> signature of
> > the message is transmitted (in the way syslog-sign intends to do).
> > 
> > IMPORTANT FINDING: There is no problem with message transformation
> > EXCEPT when the messages are digitally signed.
> > 
> > IMPORTANT OBSERVATION: we do not yet have digital 
> signatures in syslog.
> 
> Yup. Good catch.
> 
> > CONCLUSION: we do not need to care!
> > 
> > As it looks, we are trying to solve a problem that does not yet even
> > exist. And this not-yet-existing problem is the only issue that is
> > causing us us real grief here, especially if we look at backwards
> > compatibility. syslog-sign is still in draft state right 
> now. It is free
> > to place further restrictions on whatever -protocol specifies. Of
> > course, it should not do this in an unexpected and 
> unnecessray way. It
> > can be done quite non-intrusive, at least for the vast majority of
> > syslog data. Please read on, the simple solution will be 
> below, but I
> > need to switch the topic back to syslog-protocol.
> > 
> > With all that said, I propose the following for the MSG field in
> > syslog-protocol (in regard to control characters):
> 
> Ok.
> 
> > MSG MAY contain any character including octets with values 
> less then 32.
> > This is the US-ASCII control character range without DEL, which I
> > generally consider harmless. HOWEVER, it is RECOMMENDED 
> that MSG does
> > NOT include any characters with octet values less then 32.
> 
> Ok.
> 
> > This applies
> > to both UTF-8 encoded data as well as other data.
> 
> No difference.
> 
> > If a syslog sender
> > uses octet values less than 32, it MUST expect that a 
> receiver modifies
> > the message, which will lead to invalidation of eventually existing
> > digital signatures.
> 
> Ok.
> 
> > If message transformation is not acceptable to the
> > sender, it MUST escape octet values less then 32 before sending the
> > message. All other Unicode control character sequences are not
> > considered extremely problematic, but are best avoided if no message
> > transformation is required. LF and NUL have no special 
> meaning per se.
> > Most importantly, they do NOT indicate the end of the MSG field.
> 
> Ok.
> 
> What about bidirectional text?
> 
> > I think this proposal
> > 
> > a) provides an easy way to properly encode all 
> currently-existing syslog
> > MSG content
> > b) provides guideline for new implementation
> > c) cautions against control character usage
> > d) levels ground for syslog-sign
> > 
> > While allowing everything, it tells the implementor what is bad.
> > Syslog-sign could then use the hint provided here and restrict
> > to-be-signed messages not to include the US-ASCII control character
> > range without any transfer encoding (like base64).
> > 
> > Think this proposal provides a backwards-compatibile and 
> yet extensible
> > way to useful MSG content formatting.
> > 
> > Please let me know any objections you might have and, if so, please
> > precisely describe the problem you are seeing. Examples, external
> > references, and/or lab test results would be appreciated in 
> those cases.
> > 
> > Many thanks,
> > Rainer
> > 
> > Tom Petch's Digression on "character encoding" terminology:
> > ####
> > Character Set is a set of characters (letters, number, 
> symbols, glyphs
> > ...)
> > Coded Character Set [CCS] gives each a (numeric) code, as 
> in ISO 10646.
> > Character Encoding (Scheme/Syntax) [CES] specifies how the 
> codes become
> > octets as in
> > UTF-8.
> > Transfer Encoding/Syntax specifies how the octets are put 
> on the wire,
> > as in
> > Base64.
> > 
> > MIME conflates CCS and CES to charset but keeps (Content) Transfer
> > Encoding
> > distinct; they can be different in different parts of an e-mail.
> > ####
> 
>      paf
> 

_______________________________________________
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog

RE: [Syslog] MSG encoding and content (#3, #4, #5) (fwd)

Reply via email to