Chris, I can agree to what you propose. So it's fine with me.
Question: does it make any sense to answer some of Patrik's questions (in order to obtain some more advise). I guess he is pretty busy, so we might save this for later. I'd appreciate your advise. Rainer > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick > Sent: Wednesday, December 07, 2005 8:11 PM > To: [EMAIL PROTECTED] > Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) > > Hi Folks, > > I asked Patrik Faltstrom to review this proposal. He has > some comments > below. Let's don't get hung up in his details - he has > looked this over > without any knowledge of our prior discussions. He does have > some good > pointers. > > We may want to consider a "belt and suspenders" approach. > > - senders MAY indicate their charset in the SD-ID. If the > SD-ID does not > contain any indication of a charset, then the receiver will > just have to > guess (it may be US-ASCII or it may be something entirely different). > Having the UTF-8 BOM there would be a good indication that it > is UTF-8. > > - senders are RECOMMENDED to include a charset indicator in > the SD-ID. > The ONLY one defined in the syslog-protocol will be > [charset="UTF-8"]. > When that is specified, then the BOM MUST be present. > > To address Bazsi's concerns of too many charset definitions, > Rainer could > indicated that additional charset values can only be accepted > by the IANA > through Standards Action (RFC 2434). > > As Patrik indicates, it would be good to see this separated into > - what can the sender send > - what will the receiver expect to receive. > > > I would like to see other comments on this proposal. I need > to review the > threads but I believe that we have rough consensus on all of > the other > issues so that Rainer can re-work syslog-protocol. > > Thanks, > Chris > > PAF's comments below >>> > > > ---------- Forwarded message ---------- > Date: Wed, 7 Dec 2005 17:23:24 +0100 > From: "[ISO-8859-1] Patrik Fältström" > To: Chris Lonvick <[EMAIL PROTECTED]> > Subject: Re: [Syslog] MSG encoding and content (#3, #4, #5) (fwd) > > > Let's first quickly review what has been discussed on list: > > > > - current implementations sometimes use LF as a record delimiter > > Ok > > > - some implementations use LF inside the MSG part > > Ok > > > - some implementations include binary data in syslog messages > > and would like to continue to do so (but these seem to be few) > > Ok > > > - there are at least some use cases where a syslogd can not > > definitely detect the character encoding of a message > > (some of that might be related to the POSIX API, but there > > may be a work-around [I had no time yet to evaluate this > > in-depth]). It gets problematic if a message from a legacy > > sender is received (no encoding information) and transformed > > into a syslog-protocol message [I assume this is a valid > use-case]) > > Ok > > > - previous discussion showed the need for Unicode. With Unicode, the > > term "printable character" basically becomes useless, > because there > > are so many non-printable characters in Unicode and new ones are > > potentially added constantly. > > Well...define "printable"... I don't really know what that means. > > > - previous consensus thus was that any valid UTF-8 string MUST > > be supported inside MSG (including NUL and LF) > > NULL and LF are part of Unicode, and because of that UTF-8. > The encoding UTF-8 > encode NUL and LF as one byte only, with the same value as LF > and NUL as we are > used to. > > > - current discussion has shown that backwards-compatibility > > is not absolutely vital (but still desirable) > > Ok. Solves some of the binary problems. > > > - it was suggested that an "encoding SD-ID" be defined which > > carries the character set definition > > Hmmm....why is the charset definition needed? That is then to > be able to say > UTF-8 or BIG5 or...? It seems to be better and more important > to say whether it > is UTF-8 or for example binhex encoded binary data. > > Remember that the main difference between text and binary is > that text is to be > converted regarding linebreak algorithms, while binary data is not. > > > - as a side-note, Tom Petch has provided a very good digression > > on "character encoding" terminology which I have reproduced after > > my signature. I guess most people on this list already know the > > exact differences, but I still find it useful... > > Ok. Can not remember I have seen it, but anyway... > > > It is somewhat hard to find a good compromise. A > compromise, in my point > > of view, must allow the following: > > When looking at a protocol like this, you have to first of > all define whether > the charset translation/transformation is happening in the > client or in the > server. This is not really clear to me. If the transformation > is in the client, > the client translate to for example UTF-8. It can also be the > server doing it. > (Or of course a client that read from wherever the syslog > daemon store the > data, so that the storage can handle multiple charsets...but > I think this is > out of the question?) > > > - transforming existing messages into -protocol format should > > not intentionally be forbidden - transformation is a very > > important "feature" when it comes to deploying new technology > > Yup. > > > - new receivers should be able to precisely "enough" understand > > the message content > > Ok. Message content from old senders? > > > - I also find it advisable that newer receivers are capable > > to process both old-style and new-style messages concurrently. > > While this is an implementation issue, it might be a hint for > > us that some subleties in character encoding must be dealt > > with in any case. > > Ok. > > > - we should try NOT to include the myriad of possible encoding > > technologies, at least not promote this for needs other than > > backwards compatibility > > You have to differ between: > > - The protocol have the ability to handle any encoding technology > - What encoding technologies to have as a MUST or SHOULD implement > > Two different things. > > > To solve the encoding issue, an "encoding" SD-ID has been > proposed that > > describes the encoding of the MSG part (I do not use > precise wording on > > which encoding, simply because it is not relevant in this > context - read > > on...). This SD-ID would by its very nature be optional. I follow > > Darren's reminder that truncation can always make SD-IDs > (all or part) > > disappear. As such, the encoding specification would not be > guaranteed > > to be received by the final destination. This contradicts with the > > intension of that SD-ID: it's ultimate purpose was to enable the > > receiver to use proper decoding for the MSG part. > > Ok. > > If you talk about truncation, the important thing is that the > encoding > information is coming before the data that is encoded, so the > data and not the > meta-information is truncated, if any. > > > Of course, this also raises the question if the SD-ID > concept is good > > enough. For obvious reasons it suffers from the lack of > reliability. I > > think this in general is acceptable. The only cure would be to bring > > reliablity and thus full-duplex communication to syslog. This is way > > beyond our charter (if you like this, you should probably > join NETCONF > > and help on NETCONF notifications). We have addressed this > concern by > > moving all absolutely vital data to the header. If we allow multiple > > encodings, the information about the encoding belongs into > the header, > > so we would have another header field. While this is a > solution, I think > > it is overengineered for what we actually need. > > Ok. > > > Let us keep in mind that our ultimate desire is to have as > many messages > > as possible use Unicode (CCS) and be UTF-8 encoded (CES), with with > > UTF-8 also being the transfer encoding (Tom: I hope I got > it right ;)). > > In IETF, we say "the charset is UTF-8", and with that we > imply Unicode is the > character set. > > So, don't get stuck in the details. > > See RFC 3629. Just reference that. > > Note byte order. > > > Any other encoding should only be supported for backward > compatibility > > either at the protocol level (transforming relays) or to leverage > > existing APIs (POSIX et al). So we are accepting the fact that other > > encodings need to be used, but we do not really like it (at least I > > don't). > > > > Assigning a header field for such a somewhat auxiluary > feature would put > > to much weight on it and may even promote its use. > > > > So I am now back to the proposal with the Unicode BOM. Let's keep in > > mind that we either a) know the character set [then we can > convert to > > Unicode] > > No, not really. You can not do a proper conversion without > loosing data. The > question is whether you include the conversion as part of the > protocol. Who is > doing the conversion? Is a non-UTF-8 charset allowed in the > protocol? In that > case, the receiver of the message is supposed to do the > translation...right? > > > or b) we do not know it [then we can convey no information > > about it, because else we would actually have case a)]. So a simple > > indication whether or not MSG contains UTF-8 would be sufficient. > > New-style is no problem. Old style is hard. > > > I hereby propose that we RECOMMEND to use UTF-8 in all > cases where this > > is possible. If UTF-8 is used, the MSG field MUST be prefixed by the > > properly-encoded Unicode BOM (a 3-octet overhead). > > See http://www.unicode.org/faq/utf_bom.html#29 > > You can not enforce this I think. I think you should instead > have a proper > header that say whether this is text and whether it is UTF-8. > > > Any other encoding > > MAY be used. In this case the MSG field MUST NOT start with > the octet > > values of the 3-octet UTF-8 encodede Unicode BOM. > > I don't think you can say this. You don't know what other > charset's might use > as bytes. > > And, how do you know what charset is in use? > > How do you know what is binary and not text? > > > If necessary, a SP > > MUST be inserted before this sequence. Such recommendations > is within > > the expectation of a typical Unicode user/developer (at > least I strongly > > think so). > > What is "SP"? Space I guess. If one use UTF-16, space is not > one byte...and in > EBCDIC I don't know what space is either. I think you talk > about a specific > byte-value here, and not "space" as you don't know what to > look for when you > don't know what charset is in use. > > > The specification of other encodings, if there is an actual > need for it, > > should be left for a separate document. That document > should specify how > > to enhance syslog message content in a way inspired by > MIME. I expect > > such an document to make use of SD-IDs to acomplish its > goal. That would > > obviously again be subject to truncation. Here, I find this > acceptable, > > because > > Ok. > > > a) any -protocol compliant receiver would still be able to > process the > > message, at least in a basic way (thanks to the BOM) > > b) specific maximum minimum size restrictions can be placed > on compliant > > receivers supporting such a specification > > > > That "encoding" document should also address the natural > > language/culture information, which I think we should not move into > > -protocol. > > Ok. > > Possible to have alternative formats? > > > If we assume the encoding is solved, we still have not > decided on NUL, > > LF and other US-ASCII control characters. If we look at > existing syslog > > implementations, most of them use LF control characters as a kind of > > framing (End of Record - EOR - markers). Other control > characters are > > simply escaped. Plain binary data is very seldomly seen. NUL causes > > confusion to many existing receivers. > > If you use UTF-8, you are fine. > > > We can now ask ourselfs: what problem does it cause if a > sender sends a > > control character (e.g. BEL) and a relay transforms it to an escaped > > form (e.g. '^07'). If we follow this route, we see that > there is nothing > > bad with it per se. It becomes a problem only if a digital > signature of > > the message is transmitted (in the way syslog-sign intends to do). > > > > IMPORTANT FINDING: There is no problem with message transformation > > EXCEPT when the messages are digitally signed. > > > > IMPORTANT OBSERVATION: we do not yet have digital > signatures in syslog. > > Yup. Good catch. > > > CONCLUSION: we do not need to care! > > > > As it looks, we are trying to solve a problem that does not yet even > > exist. And this not-yet-existing problem is the only issue that is > > causing us us real grief here, especially if we look at backwards > > compatibility. syslog-sign is still in draft state right > now. It is free > > to place further restrictions on whatever -protocol specifies. Of > > course, it should not do this in an unexpected and > unnecessray way. It > > can be done quite non-intrusive, at least for the vast majority of > > syslog data. Please read on, the simple solution will be > below, but I > > need to switch the topic back to syslog-protocol. > > > > With all that said, I propose the following for the MSG field in > > syslog-protocol (in regard to control characters): > > Ok. > > > MSG MAY contain any character including octets with values > less then 32. > > This is the US-ASCII control character range without DEL, which I > > generally consider harmless. HOWEVER, it is RECOMMENDED > that MSG does > > NOT include any characters with octet values less then 32. > > Ok. > > > This applies > > to both UTF-8 encoded data as well as other data. > > No difference. > > > If a syslog sender > > uses octet values less than 32, it MUST expect that a > receiver modifies > > the message, which will lead to invalidation of eventually existing > > digital signatures. > > Ok. > > > If message transformation is not acceptable to the > > sender, it MUST escape octet values less then 32 before sending the > > message. All other Unicode control character sequences are not > > considered extremely problematic, but are best avoided if no message > > transformation is required. LF and NUL have no special > meaning per se. > > Most importantly, they do NOT indicate the end of the MSG field. > > Ok. > > What about bidirectional text? > > > I think this proposal > > > > a) provides an easy way to properly encode all > currently-existing syslog > > MSG content > > b) provides guideline for new implementation > > c) cautions against control character usage > > d) levels ground for syslog-sign > > > > While allowing everything, it tells the implementor what is bad. > > Syslog-sign could then use the hint provided here and restrict > > to-be-signed messages not to include the US-ASCII control character > > range without any transfer encoding (like base64). > > > > Think this proposal provides a backwards-compatibile and > yet extensible > > way to useful MSG content formatting. > > > > Please let me know any objections you might have and, if so, please > > precisely describe the problem you are seeing. Examples, external > > references, and/or lab test results would be appreciated in > those cases. > > > > Many thanks, > > Rainer > > > > Tom Petch's Digression on "character encoding" terminology: > > #### > > Character Set is a set of characters (letters, number, > symbols, glyphs > > ...) > > Coded Character Set [CCS] gives each a (numeric) code, as > in ISO 10646. > > Character Encoding (Scheme/Syntax) [CES] specifies how the > codes become > > octets as in > > UTF-8. > > Transfer Encoding/Syntax specifies how the octets are put > on the wire, > > as in > > Base64. > > > > MIME conflates CCS and CES to charset but keeps (Content) Transfer > > Encoding > > distinct; they can be different in different parts of an e-mail. > > #### > > paf > _______________________________________________ Syslog mailing list Syslog@lists.ietf.org https://www1.ietf.org/mailman/listinfo/syslog