On Sun, Nov 14, 2010 at 9:00 PM, Jon Steinhart wrote: >> IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's >> by an RFC 822 parser. As a consequence, unencoded white space >> characters (such as SPACE and HTAB) are FORBIDDEN within an >> 'encoded-word'. For example, the character sequence >> >> =?iso-8859-1?q?this is some text?= >> >> would be parsed as four 'atom's, rather than as a single 'atom' (by >> an RFC 822 parser) or 'encoded-word' (by a parser which understands >> 'encoded-words'). The correct way to encode the string "this is some >> text" is to encode the SPACE characters as well, e.g. >> >> =?iso-8859-1?q?this=20is=20some=20text?= > > Well sure, that's in the RFC but it doesn't really make a lot of sense to me.
It does, from a historical perspective of email. RFC 822 was the defacto standard, so the MIME specification attempted to not break 822 and play nice with systems that were not MIME-aware. Also, the MIME spec writers saw MIME as something that could be added on to existing systems w/o requiring re-implementation of mail parsing code. Therefore, as the note states, spaces cannot appear inside coded words since a MIME implementation sitting on top of an existing 822 parsing system would never see the full encoded string since the 822 tokenizer would have broken it up into separate tokens due to the space. > Would be way more sensible in my opinion to decode everything and then parse > it > as it would eliminate a zillion special cases in RFC-land. In general, non-ASCII encoded data only occurs in a few header fields, mainly those that are entered/edited by the user, like subject and recipient fields. > And, the fact that > you can't have an encoded word for H next to an encoded word for I to make HI > just leads to to the RFC2231 ugliness. In any case, they chose to do it the > overly complex way. But my question is really what do do when somebody sends > me this: > > =?iso-8859-1?q?this is some text?= A MIME compliant system would leave it untouched since it is not a valid encoded word. If nmh's core parsing engine was a RFC 822 (or 2822) tokenizer, and the the MIME parsing works against 822 tokens (as what was the expected implementation of MIME) then the MIME layer nevers sees the the full, bad, encoded sequence (since it was broken up). > Seems more sensible to treat the whole thing as an encoded word and to decode > it. > Are you suggesting that I should just treat it as text and not decode it? The answer somewhat depends on how you implement things. If you implement things in the manner of 822-parsing pass, then MIME parsing pass, the above invalid string would not get decoded. If you are short-cutting full 822 parsing, then the decision is not as clear. Of course such short-cutting may make your parsing not deal with other things that are 822 legal. IIRC, full 822 parsing is kind of ugly. I think 2822 attempted to simplify some things, but I have not looked into 2822 in depth. >> As for space between encoded word, such space should be >> collapsed. I.e. Two adjacent encoded words should be >> concatenated together after decoding, with no space between >> them. > > Where in what RFC do you find this. RFC2047 section 5, (1) says that encoded > words must be separated from each other by linear white space but doesn't say > that that white space is later removed. The RFC says the encoded words must be limited in length (75 chars, including the encoding meta-chars--e.g. =?...?=). If a given set of text is very long, then it must be split into multiple encoded words that are folded (CRLF SP). Hence, during the 822 parsing pass, each split word would become its own token (since there is a CRLF SP separator), so inorder to correctly reconstruct the original, unencoded text, the two tokens must be concatenated together after decoding. > Hmm. Where in what RFC is this prohibited? I'll agree that it doesn't make > a whole lot of sense to have so many mechanisms that do the same thing, but > what > harm would come from this if it was decoded properly. Section 5 of the RFC iterates where encoded words are allowed. > So once again, I'm not asking what is proper when encoding a message. I'm > asking > for guidance on sensible behavior when decoding an improperly encoded message. Ah, the liberal in what you accept mantra. IMO, a bad policy that time and experience has shown us, especially from a computer security perspective. It also allowed bad implementations to get away with things that should have been corrected from day one. The ultimate answer is how often such bad data occurs in the real world. If it is rare, I do not see it is worth the effort to complicate parsing code to deal with it, especially when not dealing with it will not really break anything (i.e. the mail message can still be read, nmh will not crash, etc). > I'm unaware of any cases where the character sequences for encoded words would > appear in any properly formatted items such as dates or addresses. So it > seems > to me that no harm would be done if I decoded such illegal stuff anyway as the > alternative is an error message. Actually, non-ASCII encoded words can occur in address fields, mainly in the comments, like how comments in address fields are used to show the human names of the associated addresses. Of course, they are not allowed in the address itself. > I'm trying to design a simple piece of code that will reasonably process > everything. > Of course, it can't be that simple since there are two incompatible Q > encodings and > other such cruft. But I really don't want to have to parse every single type > of > header because it's pretty much all text from the scan point of view. If you have an 822 tokenizer, then you have section 5 of RFC 2047 which tells which token types non-ASCII decoding can be done. Section 5 even shows the modified ABNF grammer for 822 rules on where an encoded word can appear. I know I'm probably not giving you the answer you need, but the answer ultimately depends on what kind of parser you plan on implementing, with consideration of what the frequency of bad data there is in the wild that nmh should deal with. --ewh P.S. As a point of experience, in my Perl program that parses mail, including MIME mail, I initially took the easier route of parsing header fields vs full tokenization. However, over the years, the simplier parsing did not work for all, legitimate cases, so I ended incorporating more robust, 822 parsing, on select header fields to ensure correct behavior in my program. There is a risk when trying to take shortcuts with header field parsing. You may be to handle most cases just fine, but the minority cases may cause problems, and then you end up trying to hack/patch the parser to deal with them, which can get ugly, fast. _______________________________________________ Nmh-workers mailing list [email protected] http://lists.nongnu.org/mailman/listinfo/nmh-workers
