Hi, Thanks to Brandon for reading, and raking up the past :)
It does sound like the rolling checksum is a new idea, so let me explain why I think it is useful, and note that you may know it from RSync. It helps in efficiently detecting the original text within an extended new text. If we have an old text (body or header: size doesn't matter) of size N, and a new one of size N+M, then computing checksums or hashes over all possible embeddings of the original in the new text would normally take O(N*M) elementary computations (such as adding a character to a checksum). With a rolling hash, you compute a checksum over the first N characters, and roll forward by one position in two elementary computations: addition of the next character and removing the one at the beginning; the complexity of searching for the text is now O(N+M), so in the order of the total text size. This is used as a hint that it would be useful to attempt a secure hash algorithm on the portion of a text that matches the checksum. This explains why the proposed DKIM-Signed-Content header holds: - the location of the text (body, header, or something nested) - the size N of the signed/original text - the outcome of a rolling checksum (perhaps 32 bits) - the outcome of a secure hash (to check after the hint) > will the rolling hash allow for exact boundaries of changes to be > determined? Subjects are pretty short, how does a rolling hash handle > that? No problems there. The limitation is that texts may be added before and after the original text, but nothing can be stripped. > > The mime body canonicalization is odd, what percentage of mail isn't > multipart at this point? Interesting. I though that it was quite sensible to reduce the MIME-representation of binary content to its original binary form before signing. The idea of multipart, which I may need to describe more accurately, is that the individual body parts get their own DKIM-Signature, and are then combined (following the MIME hierarchy) to form the overall DKIM-Signature for the message. By including Content-ID and Message-ID logically, it is possible to always distinguish between a body part and the full email. > It also requires that mime canonicalization requires impl of this > other spec, which seems more v2ish, though maybe not in formatting. > There is no reason why one couldn't sign separately with relaxed/relaxed and mime/mime canonicalisation. This would allow for a transition period. Even with just mime/mime, there is currently no punishment (but also no benefit) for mail parties that don't recognise the DKIM-Signature due to the new canonicalisation algorithm. The reason why this may not be so v2 as you suggest is that the dependency on support by intermediate mail facilities is replaced by a dependency on support in end points. Note that I'm not talking software support, but actual deployment. > I'd say the mime header canonicalization doesn't go far enough. Thank you, I tend to agree. I had considered character sets as mere interpretations of binary content, but was unsure if they were actually rewritten in passing. > For example, most mailing lists that are i18n aware are going to > decode rfc2047 subjects before adding a subject prefix, and then > reencode, which may not be in the same charset, especially if the > prefix and subject are in different languages. A starting charset may be recognised of course, and the number of original forms would be very limited and may be iterable? This raises concerns of potential loss of information due to incomplete / incorrect rewrites, however. > Another issue we see is in optional things, such as quoting in address > and parameter headers. Defining a canonical form for those headers > might be necessary to accomplish what you want. This would be specific to the MIME-headers, I suppose. Yes, a canonical form would then be a useful addition to the MIME header canonicalisation prescription. The list of headers in use there is limited, so it would be some work desscribing it, but not impossible, as far as I can tell. A predefined order for parameters, always using quotes, only escape when it is required, never mentioning default values, that sort of thing. > There's also smtputf8 downgrades, which would imply taking a utf8 > nonencoded subject and encoding it. It's also useful if your system > uses a non rfc822 format internally and only gateways to the internet. These are indeed pesky corner cases :-S so thanks for pointing them out right away. I am tempted to think that a limited number of alternative representations for the Subject are possible, and could be iterated over to recognise the header properly. Always mapping characters to the largest set spanning the current use case could help; that would be possible in the before and after case and lead to the same canonical format. [I may however be naive about i18n and especially what overlap character sets have. Please tell me if that is the case.] > I think a proper mime canonicalization would be useful on it's own, > we've talked about it before internally here at Google. Thanks. > I believe that we discussed something similar to this early on this > list, but the challenge is in the details. You certainly convinced me of that. If you have anything to add to the former, then by all means let me know. It is extremely useful to know what I'm heading for when deciding to get into this or not. Also, if the list thinks this is not a fruitful endeavour, then I welcome further critiques :) Thanks, -Rick _______________________________________________ dmarc mailing list [email protected] https://www.ietf.org/mailman/listinfo/dmarc
