On 10/26/2017 11:11 AM, Warren Kumari wrote:
Wes and I do believe that this is an important document - getting
these timers wrong potentially has really bad security implications;
So, pretty please, review this document and send feedback. We've tried
hard to make it readable, but the topic is unfortunately complex and
can only be simplified so far - it is also really hard to talk about
sliding windows of time.
*sigh*
It's really not complex and its neither a timer nor an interval nor a
window - but a fixed point in time given the input data:
earliest date when its safe to revoke ALL old trust anchor keys ::=
latest expiration date of any RRSet containing the key(s) to be
revoked + queryInterval (from 5011) + holdDownTime (from 5011) +
queryInterval (from 5011)
earliest date when the zone owner thinks its safe to revoke all the
old keys ::= the above + safetyFactor
The first query time is the maximum time it takes for all resolvers to
make their first query (assuming no retries). The second query time is
the time for all resolvers to make the "next query after the hold down
time expires" (again assuming no retries).
The safety factor is there primarily to deal with network outages AT THE
RESOLVER and is a SWAG that should represent a value that captures the
answer to the question "given the retry interval and a 99% network
uptime and N resolvers, how long until 99.99% of the resolvers have
completed all necessary retries at both the beginning and the end of the
process?" Like most retry questions, this has a bi-modal answer - given
a reliable network, the vast majority of resolvers will be successful in
small multiples of the retry interval. A very tiny few will not be
successful even after 100s of retries. The SWAG should pick a number of
retries that balances the operational need to complete the process with
the possibility of a few resolvers not getting the word.
so: safetyFactor ::= retryInterval (from 5011) * (5 + func(N)) - 5 is a
SWAG to set a minimum for even a small group of resolvers.
I was thinking Log2(N) - which would give 23 for 10 million resolvers or
28*retryInterval. Or something suitably non-linear....
Note that "latest expiration date of any RRSet containing the key(s) to
be revoked" makes a worst case assumption: that the pre-signed RRSets
are not necessarily protected from disclosure. If this is not true,
then this reduces down to "the latest expiration date of any RRSet
containing the keys to be revoked that has been seen publicly". The
document SHOULD assume that any signed RRSet may be available to an
attacker whether it's been published in the DNS or not.
5011 was a timer based document because it applied to each resolver in
each resolver's time domain. When a timer fired on resolver A, it had
no impact on the behavior of resolver B nor on the DNS server that was
being queried.
With this document, the 5011 timer intervals are useful only to figure
out an earliest possible safe date given the previous live data set.
When that (those) data set become irrelevant for the purposes of Wes'
attack is pretty straight forward: when it expires! The 5011 intervals
can be used to calculate forward from that DATE and TIME to get an idea
of when most well-behaving resolvers will have accepted new trust
anchors even if they were being attacked. Note "most". There is no
guarantee that even if you waited 6 years that all resolvers would get
the new trust anchors and you just need to accept that the fall back is
for the resolver owner to fix the problem when it occurs.
This discussion has been helpful because it forced me to consider two
things the root does that I never contemplated in 5011: 1) A steady
state of a single trust anchor and 2) pre-signing RRSets. Neither of
these affects the resolver implementation, but both make the
publishing/signing schedules more security sensitive - hence this document.
Later, Mike
_______________________________________________
DNSOP mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/dnsop