Hi Andy,
Thanks again for engaging in detail. The text suggestions are also appreciated!
The changes from this iteration are summarized in
https://github.com/desec-io/draft-ietf-dnsop-ds-automation/commit/d9fed34480c571be3b49830c966dba264078090f.
As before, the rendered diff from the last revision is visible at
https://author-tools.ietf.org/iddiff?url1=draft-ietf-dnsop-ds-automation-08&url2=https://desec-io.github.io/draft-ietf-dnsop-ds-automation/draft-ietf-dnsop-ds-automation.txt.
For detailed responses, please see below.
On 5/21/26 00:26, Andy Newton wrote:
#### 5-15 Minutes
240 2. Parent-side entities (such as registries) SHOULD reduce a DS
241 record set's TTL to a value between 5–15 minutes when a new set
242 of records is published, and restore the normal TTL value at a
243 later occasion (but not before the previous DS RRset's TTL has
244 expired).
Why isn't this a MUST? It is a SHOULD for both behaviors (the 5-15 minute TTL)
and the restoration of the TTL. At the very least, it seems the restoration of
the "normal" TTL value is a MUST. However, why isn't the 5-15 minute value a
MUST? Otherwise an operator can set it to 5 years justified with "because I
just wanted it that way". If the SHOULD is to remain, isn't there a reasonable
upper boundary behavior such as not being greater than the normal TTL value?
Two thoughts:
- TTLs accepted by resolvers are bounded in practice. Most do not accept
arbitrarily long TTLs, as that has other security issues such as retaining a
hijacked delegation beyond its actual removal. I can collect numbers if needed,
but my recollection is that TTLs over 1-2d are not honored in practice.
- The above resolver-side constraint aside, parents can set whatever TTLs they
want, even without DS automation. DS automation does not change that. The point
of this provision in the document is to mitigate the impact of a botched DS
update, not to give general recommendations about what TTLs should be used in a
TLD registry. (In practice, TTLs in TLD zones vary between 1h and 48h.)
- It's not a MUST because there is no problem if the registry uses a TTL of 20
or 30 minutes permanently, for example. Also, when DS automation is performed
by the registrar, the registrar may not have the ability to enforce TTL
adjustment. -- Now one could say that the document should take the stance that
DS automation is to be done by registries and not registrars. However, getting
into that discussion was very very clearly rejected by many participants: the
registries and registrars want to sort out these responsibilities between
themselves. And indeed, it's a business decision, not primarily a technical one.
If there are no interoperability issues with the registry setting 30 minutes as
the TTL value, which is outside the 5 to 15 minute suggestion, then this is not
a normative requirement and the SHOULD is to be lower-cased and/or changed to
use non-BCP14 language.
Taking the DISCUSS seriously, please excuse me for mounting a discussion ;-) I
will lay out my understanding but eventually defer as I don't have the
necessary context on which documents may (not) make which particular use of
these key words across IETF areas. Other ADs may know better than I do, so I'd
prefer to eventually defer. I appreciate the learning opportunity for when I'll
be serving as an editor in the future!
First, RFC 2119 Section 6 (and the IESG statement [1]) mandates to only use uppercase key
words when "actually required for interoperation or to limit behavior which has
potential for causing harm". The RFC itself details:
... potential for causing harm (e.g., limiting retransmisssions) For
example, they must not be used to try to impose a particular method
on implementors where the method is not required for
interoperability.
In this context, it seems to me that your view (~"30 minutes TTL don't cause an
interoperability issue, so uppercase is inappropriate") is not compelling. For
example, a higher TTL does have a slight risk of causing retransmissions, namely when the
new DS record set is botched and causes SERVFAILs on lookups, which are not cacheable and
will cause look-up retries. When the DS TTL is set to 5-15 minutes instead of 30, there
will be less retries, as a correction to the DS RRset will be visible more quickly. The
DNSOP WG has determined that 5-15 is the best, albeit not the only choice to address this
risk. Yet, there is no interoperability issue when deviating a bit.
The IESG statement [1] also has a section on "Common Additional Uses of Key Words" which
endorses use of uppercase key words for operational requirements, and the 5-15' recommendation
arguably is an operational (soft) requirement and not just an opinion (that is, "it's best
done this way").
I will defer to your expertise regarding this being an operational requirement,
but it is not clear to me that it is from the description. If it is an
operational requirement, then whey can't it be a MUST? If it is to be a SHOULD
then what are the ramifications of violating the SHOULD? Let me offer some text:
Parent-side entities (such as registries) SHOULD reduce a DS
record set's TTL to a value between 5–15 minutes when a new set
of records is published. Using values below 5 minutes risks
excessive queries, and using values greater than 15 minutes may impact
recovery from operational mistakes. The parent-side SHOULD restore
the previous (or, if unavailable, a default) TTL value at a later
occasion (but not before the previous DS RRset's TTL has expired)
to prevent operational issues arising within the client-side processes.
Hopefully I got the technical consequences correct.
You got them perfectly correct :-) And I like the words you added.
The document's structure is that all recommendations are given in sections *.1
(and consolidated in Appendix A), whereas considerations around the why and
what-if-not are all in sections *.2. The currently proposed text conflates
that, for a single recommendation only.
I've thus taken your words and put them in the corresponding analysis section
4.2:
OLD
Registries therefore should significantly lower the DS RRset's TTL
for some time following bootstrapping or an update. Pragmatic values
for the reduced TTL value range between 5–15 minutes. Such low TTLs
might be expected to cause increased load on the corresponding
authoritative nameservers; however, recent measurements have
demonstrated them to have negligible impact on the overall load of a
registry's authoritative nameserver infrastructure [LowTTL].
NEW
Registries therefore should significantly lower the DS RRset's TTL
for some time following bootstrapping or an update. Pragmatic values
for the reduced TTL value range between 5–15 minutes. Using values
below 5 minutes risks excessive queries, and using values greater
than 15 minutes may impact recovery from operational mistakes.
Note that recent measurements have demonstrated low TTLs like the
above to have negligible impact on the overall load of a registry's
authoritative nameserver infrastructure [LowTTL].
To reiterate, this is my understanding, but I may not have the full picture
across IETF areas.
[1]:
https://datatracker.ietf.org/doc/statement-iesg-statement-on-clarifying-the-use-of-bcp-14-key-words/
Our discussion made me reconsider an earlier change I had confirmed in
https://mailarchive.ietf.org/arch/msg/dnsop/bO0CX4FPyS907p0HZLF5Xilxbdo/,
namely that the reporting recommendations in Section 5.1 should be lowercased.
That was based on the insight that, as you pointed out, reporting is not an
interoperability requirement. However, the "Common Additional Uses of Key
Words" section of [1] explicitly lists:
- Operational requirements, especially around mandatory logging
and configuration needed to produce successful deployments
It seems to me that reporting is pretty much the same as logging, except to a
different party (namely, the child operator, the registrant and the like). The
guidance in the IESG's statement is not limited to actual logging, it only
names it especially, yet as an example. Section 5.1 thus appears appropriately
covered by this clause. As a result, I've reverted this change.
As above, I'm not objecting to making any changes if I'm in the rough; I just
don't know whether I am.
If your opinion is that a successful deployment requires reporting then it does
sound like an operational requirement. Then the question is MUST or SHOULD. If
SHOULD, then can the consequences of not following that SHOULD be described?
Perhaps the following text before the numbered entries in 5.1 helps: "One or more of
the reporting methods described below MUST be implemented."
The different reporting methods address different aspects, so it's not like
implementing one mitigates the adverse effects of not implementing another.
Rather, they are all very useful for debugging purposes, to prevent confusion
etc. (like logging, so it's covered by BCP 14), but they are not crucial in
order to make DS automation work when everything runs smoothly. It's thus very
reasonable to implement these, but MUST seems too much.
We could add that changes and/or errors may go unnoticed if reporting is skipped, and
then whatever consequence is imaginable from that may ensue. However, that's quite
obvious, and it doesn't help the reader to add text that's not meaningful. The analysis
section on reporting (5.2) contains a lot of reasoning about the topic, and you may
regard that whole section as "qualifying the SHOULD".
Note that the IESG statement on BCP 14 does not request that "SHOULD exceptions & consequences" we
explained immediately adjacent to where the key word is used. It only points out that providing "readers
with all of the details they need to make an informed decision" is more "valuable" than not doing
so. I really think that the interplay of sections 5.1 and 5.2 (mirroring the general structure of the document)
is suitable for that purpose.
#### Both CDNSKEY and CDS
246 3. DNS operators SHOULD publish both CDNSKEY and CDS records, and
247 follow best practice for the choice of hash digest type
248 [DS-IANA].
Section 4.2.3 does a good job of explaining why both CDNSKEY and CDS are
needed, so it seems the justification here is a MUST. In other words, if you
want to interoperate, the operator MUST do this otherwise there is likely to be
a problem.
There is no problem if the child operator knows which type of update format
(CDS or CDNSKEY) the parent consumes. There would only be a problem if the
parent insisted on both being published. Indeed, that provision was part of the
draft, but was removed in -01 due to argument that it's not a good idea to
reject an update just because a redundant format is missing.
A written record of this argument can be found at
https://mailarchive.ietf.org/arch/msg/dnsop/ObpPwt5_HrmsPXE3dG8CrJUP9mg/; more
feedback in that direction was gathered during an in-person workshop at DNS
OARC 45.
Given these points, it seems not generally required for interoperability to
publish both CDS and CDNSKEY, although if you do so you can be agnostic about
the parent's preference and avoid certain risks. But, if you know what you're
doing and what the parent expects, the risk goes away. Hence the SHOULD.
See also
https://mailarchive.ietf.org/arch/msg/dnsop/Uv-oyqj-gp1dlfPIofct22Ijqo0/.
Why can't the guidance be more explicit? "Unless the child and parent have agreed
upon using either CDS or CDNSKEY, DNS operators MUST publish both CDNSKEY and CDS."
Such agreement is not to be expected in practice, as domain registries generally do not
have any sort of agreement with DNS child operators. There may be rare exceptions (which
"SHOULD" would accommodate), but the registry's preferred ingestion format is
normally published in their DNSSEC Practice Statement.
Additionally, such agreements are something to actually advise against, because they
ossify expectations and prevent the registry to change their preferred format later. If
either CDS or CDNSKEY are eventually deprecated (and removing this dichotomy would be a
good thing), at least some registries would have to make that change, and agreements like
the above would stand in the way. (That's why my point earlier was on the child operator
"knowing" the preference, such as from a DNSSEC Practice Statement, not the
registry committing to it in an agreement.)
Is your objection that once an agreement is reached it will not change? Or that
it is hard to change?
Mainly the latter; there's always a long tail. We therefore should not
encourage them (and the associated ossification).
I guess the onus can be placed on the child entirely: "Unless the child has
knowledge of the parent's preference, both CDNSKEY and CDS MUST be published."
Brilliant! Sometimes solutions are so obvious when you see them. Reworded
slightly (so that the acting party is at the beginning of the sentence for
catchiness):
OLD
3. DNS operators SHOULD publish both CDNSKEY and CDS records, and
follow best practice for the choice of hash digest type
[DS-IANA].
NEW
3. DNS operators MUST publish both CDNSKEY and CDS records (unless
the parent's preference is known), and follow best practice for
the choice of hash digest type [DS-IANA].
----------------------------------------------------------------------
COMMENT:
----------------------------------------------------------------------
[...]
### RDAP TTLs
410 5. The currently active DS configuration SHOULD be made accessible
411 to the registrant (or their designated party) through the
412 customer portal available for domain management. The DS update
413 history MAY be made available in the same way.
https://datatracker.ietf.org/doc/draft-ietf-regext-rdap-ttl-extension/ is also
a good way to make the currently active TTL values available to the registrant
or their designated party.
I will make a note in Section 5.2. My impression is that publicly stating the
TTL policy would belong into DNSSEC Practice Statements, so I'd be reluctant to
include it as a recommendation.
I agree. That is only informative, as is all of item 5.
I beg to disagree: As mentioned above, uppercase key words are fine "to limit
behavior which has potential for causing harm". Operational harm is certainly
imaginable when a registrant can't inspect their current configuration after it has been
changed through DS automation, and the registrant wrongly assumes a different state while
operating their nameserver (I'll be happy to come up with some specific instances where
that could play a role).
I agree. Can that be put into the draft? "Failure to provide the registrant a means
to inspect their current configuration after it has been changed may cause the inability
of the registrant to recover from operational incidents because the registrant may have
out-of-date information."
Sounds good. I've made the following change:
OLD
The registrant (or their designated party) should be able to retrieve
the current DS configuration through the customer portal available
for domain management.
NEW
The registrant (or their designated party) should be able to retrieve
the current DS configuration through the customer portal available
for domain management. Failure to provide the registrant a means to
inspect the current configuration after it has been changed may
hinder recovery from operational incidents because the registrant may
have out-of-date information.
Best,
Peter
_______________________________________________
DNSOP mailing list -- [email protected]
To unsubscribe send an email to [email protected]