Hi,

Thanks to Brandon for reading, and raking up the past :)

It does sound like the rolling checksum is a new idea, so let me explain
why I think it is useful, and note that you may know it from RSync.  It
helps in efficiently detecting the original text within an extended new
text.

If we have an old text (body or header: size doesn't matter) of size N,
and a new one of size N+M, then computing checksums or hashes over all
possible embeddings of the original in the new text would normally take
O(N*M) elementary computations (such as adding a character to a
checksum).  With a rolling hash, you compute a checksum over the first N
characters, and roll forward by one position in two elementary
computations: addition of the next character and removing the one at the
beginning; the complexity of searching for the text is now O(N+M), so in
the order of the total text size.  This is used as a hint that it would
be useful to attempt a secure hash algorithm on the portion of a text
that matches the checksum.

This explains why the proposed DKIM-Signed-Content header holds:
 - the location of the text (body, header, or something nested)
 - the size N of the signed/original text
 - the outcome of a rolling checksum (perhaps 32 bits)
 - the outcome of a secure hash (to check after the hint)

> will the rolling hash allow for exact boundaries of changes to be
> determined?  Subjects are pretty short, how does a rolling hash handle
> that?

No problems there.  The limitation is that texts may be added before and
after the original text, but nothing can be stripped.
>
> The mime body canonicalization is odd, what percentage of mail isn't
> multipart at this point?

Interesting.  I though that it was quite sensible to reduce the
MIME-representation of binary content to its original binary form before
signing.

The idea of multipart, which I may need to describe more accurately, is
that the individual body parts get their own DKIM-Signature, and are
then combined (following the MIME hierarchy) to form the overall
DKIM-Signature for the message.  By including Content-ID and Message-ID
logically, it is possible to always distinguish between a body part and
the full email.

> It also requires that mime canonicalization requires impl of this
> other spec, which seems more v2ish, though maybe not in formatting.
>
There is no reason why one couldn't sign separately with relaxed/relaxed
and mime/mime canonicalisation.  This would allow for a transition
period.  Even with just mime/mime, there is currently no punishment (but
also no benefit) for mail parties that don't recognise the
DKIM-Signature due to the new canonicalisation algorithm.

The reason why this may not be so v2 as you suggest is that the
dependency on support by intermediate mail facilities is replaced by a
dependency on support in end points.  Note that I'm not talking software
support, but actual deployment.

> I'd say the mime header canonicalization doesn't go far enough.

Thank you, I tend to agree.  I had considered character sets as mere
interpretations of binary content, but was unsure if they were actually
rewritten in passing.

> For example, most mailing lists that are i18n aware are going to
> decode rfc2047 subjects before adding a subject prefix, and then
> reencode, which may not be in the same charset, especially if the
> prefix and subject are in different languages.

A starting charset may be recognised of course, and the number of
original forms would be very limited and may be iterable?  This raises
concerns of potential loss of information due to incomplete / incorrect
rewrites, however.

> Another issue we see is in optional things, such as quoting in address
> and parameter headers.  Defining a canonical form for those headers
> might be necessary to accomplish what you want.

This would be specific to the MIME-headers, I suppose.  Yes, a canonical
form would then be a useful addition to the MIME header canonicalisation
prescription.  The list of headers in use there is limited, so it would
be some work desscribing it, but not impossible, as far as I can tell. 
A predefined order for parameters, always using quotes, only escape when
it is required, never mentioning default values, that sort of thing.

> There's also smtputf8 downgrades, which would imply taking a utf8
> nonencoded subject and encoding it.  It's also useful if your system
> uses a non rfc822 format internally and only gateways to the internet.
These are indeed pesky corner cases :-S so thanks for pointing them out
right away.

I am tempted to think that a limited number of alternative
representations for the Subject are possible, and could be iterated over
to recognise the header properly.  Always mapping characters to the
largest set spanning the current use case could help; that would be
possible in the before and after case and lead to the same canonical
format.  [I may however be naive about i18n and especially what overlap
character sets have.  Please tell me if that is the case.]

> I think a proper mime canonicalization would be useful on it's own,
> we've talked about it before internally here at Google.

Thanks.

> I believe that we discussed something similar to this early on this
> list, but the challenge is in the details.

You certainly convinced me of that.

If you have anything to add to the former, then by all means let me
know.  It is extremely useful to know what I'm heading for when deciding
to get into this or not.

Also, if the list thinks this is not a fruitful endeavour, then I
welcome further critiques :)


Thanks,
 -Rick

_______________________________________________
dmarc mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/dmarc

Reply via email to