On Sat, Aug 20, 2022 at 10:07 AM Warren Kumari <war...@kumari.net> wrote:
> Brian Dickson recently reached out to one of the DNSOP chairs to raise > some technical concerns related to the AliasMode functionality in > draft-ietf-dnsop-svcb-https. > > Although this document has already passed WGLC, IETF LC, IESG Eval, and > was approved and sent to the RFC Editor, I want to make sure that the DNSOP > working group has a chance to discuss any lingering concerns. Accordingly, > I have asked the RFC Editor to hold publication for now (note that the hold > itself is not expected to delay publication of the document, which is > blocked anyway due to missing references). > > As the document was already extensively discussed and approved, we should > only make substantive changes if they are very clearly warranted (e.g > something that would otherwise be an errata, or "OMG! That clearly doesn't > work, 1+1 doesn't equal 17…") — this is *not* an opportunity to > re-litigate existing decisions, make non-required changes, etc. > > I believe that Brian is on vacation this week, and I wasn't really able to > parse his issue with the document, so I ask him to clearly state the issue > on-list when he returns. I would like to have whatever discussions wrapped > up within 2 weeks from then so that I can release it back to the RFC > Editor. > > Pausing publication is an unusual, but definitely not unprecedented, step. > Although we are able to make changes until a document is published as an > RFC, once it is approved and sent to the RFC Editor, we should only make > (non-editorial) changes in exceptional circumstances… > > I'd like to also thank the authors and WG in advance for their time and > for keeping this discussion focused, > W > > Thank you Warren. I'll try to first raise the highest-level concern, which is that there are some elements which appear to have some level of ambiguity, that result in implementations doing different things. The place where these ambiguities exist is on the client side of things, meaning the procedures followed by clients, including how to interpret DNS responses that originate from authoritative DNS servers (either directly, or via resolvers, or via stub libraries). To be clear: the wire format parts are fine, and what an zone administrator should publish is not in any way impacted. The differences in interpretation, and the client behavior under one of those interpretations, are the problem. The easiest way forward, I think, is to try to add enough clarification to have a discussion about which interpretation has consensus. IMNSHO, the draft needs to reach the point where only one interpretation is possible, so that all implementations are in agreement, at least in the fundamental aspect of the how clients should behave. Once there is some clarification on the proposed text (e.g. with two alternative approaches equally clearly described), then the conversation can progress to "which of these is what DNSOP wants to be published"? So, having prefaced things this way, here are the specific elements that are apparently ambiguous. I'll summarize as much as I can, but I will also include a couple of emails from a thread between myself and the authors, chairs, one implementer, included mostly to demonstrate that the interpretation in question is quite specific and well articulated, i.e that I'm not possibly mis-interpreting something someone said. - The problem is whether/when/how the DNS queries are considered failures, and whether/when/how some sort of fall-back procedure is followed in those cases. - This includes ambiguity over whether further DNS queries/responses are required, if HTTP connection failures occur with resolved TARGET values. - The ONLY concern is whether an AliasMode record (particularly at the zone apex) is treated EXACTLY the same as a constrained CNAME (i.e. unconditional QNAME rewrite if the RRTYPE is appropriate). - Unconditional would imply that an HTTPS-aware (or SVCB-aware, if you prefer) client never backtracks to the origin name to look up A/AAAA records for use, or more precisely, if the client does look up the A/AAAA records speculatively, if it gets an AliasMode record, it does not use those A/AAAA records under any conditions. - Conditional would imply that there are conditions under which the client MIGHT use sibling A/AAAA records instead of a valid AliasMode record, even if the AliasMode record was cryptographically protected and did not have a Chain-Length error. This situation, even if only "under certain circumstances", is the ANAME behavior. Here is a longer description where the problems/ambiguities appear to exist, which should be clarified first, and then discussed to decide what to do about them. There are some phrases or terms that are not defined, or inconsistently used, or less than comprehensively enumerated: - In section 3, the term "SVCB-optional" specifically only refers to "ServiceMode Records" (FIXME qv section 3.1 and 10.1 exceptions) - The enumerated steps uses the phrase "SVCB resolution has failed". - "whether successful or not" plus appending the final $QNAME without SvcParams, is followed by reference to "falling back to non-SVCB connection modes". - A "connection mode" is an HTTP(S) thing, but this does not specify the DNS component of whatever is intended by "falling back". - This is immediately followed in the parenthesized text by ensuring that SVCB-optional clients will make use of an AliasMode record. - Two paragraphs later, we have: "If the client is SVCB-optional, and connecting using this list of endpoints has failed, the client now attempts to use non-SVCB connection modes." - This is not consistent with the use of AliasMode records vs CNAME records, meaning a CNAME and an AliasMode record as alternative methods of delegating authority, would behave differently. - I.e. this behavior conflicts with the stated intent and behavior from Introduction (1.), Goals (1.1), and AliasMode (2.4.2). - Also, the AliasMode section (2.4.2) has some text that conflates multiple issues, which appears to be one potential source of one of the major problems (ANAME behavior): - As legacy clients will not know to use this record, service operators will likely need to retain fallback AAAA and A records alongside this SVCB record, although in a common case the target of the SVCB record might offer better performance, and therefore would be preferable for clients implementing this specification to use. - The conflation is between "legacy clients", and "preferable for clients implementing the specification". - Legacy support is not "fallback", which is where the conflation is introduced - Non-legacy clients (which might better be described as SVCB-aware) should NOT be using records intended (by the zone administrator) for legacy-only usage. - Non-legacy clients using legacy-only records (A/AAAA records with the same owner name as an AliasMode SVCB record) is what causes the ANAME behavior to occur - ANAME was soundly rejected by DNSOP. Introducing ANAME-like behavior is a major problem - This behavior is introduced implicitly, rather than explicitly. - Having this documented explicitly is essential to resolving the client behavior ambiguity. The Client behavior section (3.): - In section 3.1 (Handling resolution failures), there are some partial enumerations that leave unstated what the alternative situation requires. - "If DNS responses are not cryptographically protected, clients MAY treat SVCB resolution failure as fatal or non-fatal". - What if DNS responses ARE cryptographically protected? And does that differ between protection mechanisms (DNSSEC vs encrypted transport)? - The first sentence (regarding cryptographically protected responses) only partially enumerates the cases, i.e. specific sources of resolution failure. - Explicit declaration of NXDOMAIN as being either a resolution failure, or not a resolution failure, would clarify this considerably (and IMNSHO, should be a non-failure). - Similarly, NOERROR/NODATA response handling should be described to avoid ambiguity. - The case consisting of a single AliasMode record (without CNAME, which MUST be handled per 2.4.2 "This limit MUST NOT be zero, i.e. implementations MUST be able to follow at least one AliasMode record."), which is cryptographically protected, and which does not have any of the enumerated resolution failures, appears to not be covered under the category of "resolution failure". It also appears to not be covered by the "MAY treat as non-fatal" clause. - If the TARGET of a single AliasMode is unreachable, or is NXDOMAIN, or has NOERROR,NODATA for A/AAAA record queries, how should this be handled? - *This appears to be the specific place *where the ambiguity can result in ANAME-like behavior, and where implementations may diverge in behavior. Apex usage of A/AAAA purposes, compared: - There are multiple possible reasons for inclusion of A/AAAA records at a zone apex: - Serving HTTPS-enabled zones to legacy clients, when a CDN serving the domain has stable A/AAAA addresses - Alerting legacy clients that they are not supported, using error pages specific to the client (e.g. User-Agent based response pages) - Non-WWW services available via IP address (SMTP, SSH, FTP, etc) - If no WWW services are present at such IP addresses, client connection attempts could negatively impact other services. - Multiple SVCB-compatible RR types may be present at a zone apex - Each such SVCB-compatible record type could have equally-legitimate fall-back address requirements - The current specification for HTTPS effectively forecloses any other use of apex A/AAAA records, if the interpretation of "fall-back" is to regress all the way to the origin name and using A/AAAA records at the zone apex - Brittle A/AAAA addressing (fast flux addresses with low TTLs) are incompatible with use of apex A/AAAA records for HTTPS-aware clients - Everything bad about ANAME would be incorporated if apex A/AAAA addresses are required to be applicable to HTTPS-aware clients - The existence of an HTTPS AliasMode record at a zone apex SHOULD cause an HTTPS-aware client to never use the A/AAAA records at a zone apex, even if the SVCB process fails or the client is unable to connect to the service over any SVCB ServiceMode end-points or the IP address(es) of the final $QNAME. Here is a summary of what I would like to have happen, personally: 1. I am strongly in favor of clarifying these issues, and once any problems are resolved, quickly moving to publication. 2. The HTTPS AliasMode record is something we want to start using as soon as possible, e.g as soon as the majority of browser vendors have implemented the correct client behavior in a major release that is widely adopted (we may be near that point modulo the Chrome issue) 3. The client implementations appear to have complexity introduced when the "fall-back" logic is required, which go away if/when the "fall-back" process is removed. This is likely a gating factor in at least one browser deploying the support for AliasMode. 4. We currently only care about AliasMode, from the perspective of authoritative DNS zones operated by us. We have implemented it and deployed it. 5. The main issue is not the interoperable wire format stuff, or the implementable state of the specification. Those are fine, and we have implemented and deployed HTTPS AliasMode support already. 6. The main issue is usability of AliasMode records -- putting HTTPS records at the zone apex (for lots of zones we manage). The "always follow the AliasMode without fallback" for HTTPS-aware clients is the key requirement. 7. Resurrecting ANAME behavior in corner cases is every bit as bad as choosing ANAME as the standard. 8. The fallback that appears to have ended up (in some corner cases) is really unfortunate, and effectively useless. If the HTTPS AliasMode record results in unreachable web sites, that's an entirely acceptable outcome in all cases. Fallback at best would partially mask the problems, making identification and correction more difficult, while also resulting in poor user experiences. 9. In the worst case, broken Targets for AliasMode records have the ability to cause problems for any legacy-only resources, including potential financial impacts (via consuming of unbudgeted resources), again, for no real benefit. Brian Dickson Included quoted texts: email from me to authors; response by Ben (one of the authors). Hi, everyone, > > I have been working through some implementation challenges in interpreting > the proscribed behavior in the current draft. > > (I'm with GoDaddy's DNS team, and have been working with the Google Chrome > folks on handling of AliasMode records.) > > The TL;DR: is that there is some ambiguity that needs to be cleared up. > > I'm hoping these issues can be cleared up with some additional text. > The exact wording isn't crucial, so much as that the client resolution > process can be made unambiguous. > > (There is an implicit familiarity expectation with core DNS specs 1033, > 1034, and 1035, where those specs themselves are somewhat lacking, and > outside of the DNS industry, not many folks have the necessary experience > to work around those issues.) > > The main issues are as follows: > > - Clarification on NXDOMAIN aka Rcode==3, as relates to Section 3.1. > - NXDOMAIN should be explicitly included in 3.1 as "not a > resolution failure per se". > - AliasMode Targets with NXDOMAIN resolution results MUST be > handled the same as CNAME resolution with NXDOMAIN. > - Clarification on the overall handling of AliasMode records. > - I believe the intent is that AliasMode records should ALWAYS be > followed, even if the ultimate disposition of ServiceMode lookups fail, > in > there should not be any backtracking to before any AliasMode lookups. > - In other words, there may need to be an extra terminology entry > for QNAME that means "QNAME after following all CNAME and AliasMode > records" > - The resolution steps at the start of Section 3 might need to be > cleaned up to distinguish AliasMode and ServiceMode related lookup > results. > - Inclusion of specification of what is meant by "fall back to > non-SVCB connection modes". > - This is referenced in a few places, but is not defined or > specified. > - IMNSHO, this connection mode should be declared as "use that last > QNAME from following AliasMode redirects (and CNAME redirects), and make > the connection using no SvcParams and using only A/AAAA records > resolved by > the looking up the last QNAME. > - Is it possibly the case that the parenthetical portion of the > fourth-last paragraph in section 3 is intended to DEFINE the fall-back > mode, rather than the last thing to try before falling back to the > currently undefined fall-back mode? > - Clarification on use of SVCB record not existing, in 3.1 > - I think the intent here is actually "SVCB ServiceMode record", > and to treat the result as if the "SVCB ServiceMode record did not > exist", > and to use the name of any redirections from CNAME and AliasMode > records as > the service endpoint to use. > - Clarification on soft vs hard failures > - The "fatal vs non-fatal" should apply only to ServiceMode records > - NXDOMAIN results on AliasMode lookups MUST be treated as hard > failures > - Any other resolution failures on AliasMode records MUST be > treated as hard failures > - NXDOMAIN handling of parallel queries > - When there are parallel queries for (SVCB or HTTPS) records along > with A and AAAA records, an NXDOMAIN response for any of them MUST be > treated as an NXDOMAIN result for all of them. (This is a tautology, > BTW.) > - It may be worth adding words to that effect, so that implementers > can avoid delays waiting for now-moot queries. This would allow faster > progression to alternative ServiceMode records, and/or terminating all > queries (if no path forward exists at an AliasMode record). > > Sorry for the late timing of this. > We (authoritative DNS implementers) considered the spec not terribly clear > but at least unambiguous. > It was only when communicating with browser vendors doing implementation > of the client side (Google Chrome in particular) that the issues surfaced. > > I think we all want this to be consistently implemented, and to be > consistent with the authors' intents. > > Please correct me if my overall understanding (AliasMode == CNAME at apex, > including obeying ALL of the behavior limits and RCODE results associated > with CNAME) isn't corect. > > Thanks, > Brian Dickson > Response from Ben: [Apologies, my mail client wouldn't quote this correctly, so the rest of this message is Ben's response, not indented/quoted properly.] On Sun, Jul 24, 2022 at 4:30 PM Brian Dickson <brian.peter.dick...@gmail.com> wrote: > Hi, everyone, > > I have been working through some implementation challenges in interpreting > the proscribed behavior in the current draft. > > (I'm with GoDaddy's DNS team, and have been working with the Google Chrome > folks on handling of AliasMode records.) > > The TL;DR: is that there is some ambiguity that needs to be cleared up. > > I'm hoping these issues can be cleared up with some additional text. > The exact wording isn't crucial, so much as that the client resolution > process can be made unambiguous. > > (There is an implicit familiarity expectation with core DNS specs 1033, > 1034, and 1035, where those specs themselves are somewhat lacking, and > outside of the DNS industry, not many folks have the necessary experience > to work around those issues.) > > The main issues are as follows: > > - Clarification on NXDOMAIN aka Rcode==3, as relates to Section 3.1. > - NXDOMAIN should be explicitly included in 3.1 as "not a > resolution failure per se". > > The current text is "... fails due to an authentication error, SERVFAIL response, transport error, or timeout". It seems to me that NXDOMAIN is clearly not on that list. Are you sure this needs clarification? > > - AliasMode Targets with NXDOMAIN resolution results MUST be handled > the same as CNAME resolution with NXDOMAIN. > > I don't think normative comparisons to CNAME are a good idea. CNAME and SVCB have conceptual parallels but work quite differently. Also, I'm not sure this would be correct. The current text says If the client is SVCB-optional, and connecting using this list of endpoints has failed, the client now attempts to use non-SVCB connection modes. In the event of an AliasMode record pointing to NXDOMAIN, I would expect SVCB-optional clients to retry with non-SVCB connection. > > - Clarification on the overall handling of AliasMode records. > - I believe the intent is that AliasMode records should ALWAYS be > followed, even if the ultimate disposition of ServiceMode lookups fail, > in > there should not be any backtracking to before any AliasMode lookups. > > As noted above, I believe this would be a substantive change from the present specification. > > - In other words, there may need to be an extra terminology entry for > QNAME that means "QNAME after following all CNAME and AliasMode records" > - The resolution steps at the start of Section 3 might need to be > cleaned up to distinguish AliasMode and ServiceMode related lookup > results. > - Inclusion of specification of what is meant by "fall back to > non-SVCB connection modes". > - This is referenced in a few places, but is not defined or > specified. > > I can't think of a clearer formal way to say "connect however you would have connected if this specification did not exist". > > - IMNSHO, this connection mode should be declared as "use that last > QNAME from following AliasMode redirects (and CNAME redirects), and make > the connection using no SvcParams and using only A/AAAA records resolved by > the looking up the last QNAME. > > This is addressed in the draft, and it is not "non-SVCB connection establishment": > > - Is it possibly the case that the parenthetical portion of the > fourth-last paragraph in section 3 is intended to DEFINE the fall-back > mode, rather than the last thing to try before falling back to the > currently undefined fall-back mode? > > For posterity, that text is: SVCB- optional clients SHALL append to the priority list an endpoint consisting of the final value of $QNAME, the authority endpoint's port number, and no SvcParams. (This endpoint will be attempted before falling back to non-SVCB connection modes. This ensures that SVCB-optional clients will make use of an AliasMode record whose TargetName has A and/or AAAA records but no SVCB records.) This is not considered "non-SVCB connection establishment" because SVCB has still influenced the QNAME. > > - Clarification on use of SVCB record not existing, in 3.1 > - I think the intent here is actually "SVCB ServiceMode record", > and to treat the result as if the "SVCB ServiceMode record did not > exist", > and to use the name of any redirections from CNAME and AliasMode > records as > the service endpoint to use. > > For posterity, the text is: If the client is unable to complete SVCB resolution due to its chain length limit, the client MUST fall back to the authority endpoint, as if the origin's SVCB record did not exist. The intent here is indeed to fall back all the way to the authority endpoint. If clients would only fall back to some intermediate point in the alias chain based on their length limit, operators would become obligated to offer the service from every intermediate name in the chain. By falling back all the way to the authority endpoint, we ensure that operators are only required to offer service at the authority endpoint (i.e. non-SVCB connection) and the actual SVCB service endpoints. This is essentially parallel to CNAME: service operators are not obligated to offer the service at each step of a CNAME chain. > > - Clarification on soft vs hard failures > - The "fatal vs non-fatal" should apply only to ServiceMode records > > For posterity, the text is: If DNS responses are not cryptographically protected, clients MAY treat SVCB resolution failure as fatal or nonfatal. I'm not sure what you're saying here. When a SVCB DNS query fails, the client doesn't know whether that query would have returned an AliasMode or a ServiceMode query. Regardless, the point of this line is merely to reiterate that the downgrade protections considered by this section are largely moot if there is no security between the client and resolver. > > - NXDOMAIN results on AliasMode lookups MUST be treated as hard > failures > > > - Any other resolution failures on AliasMode records MUST be treated > as hard failures > > As I think is clear from Section 3, failure to resolve an AliasMode TargetName is indeed a hard failure for SVCB-required clients, but SVCB-optional clients can tolerate this by abandoning SVCB entirely. We could change this, by declaring that SVCB-optional clients MUST disable their fallback in this case, but I see no advantage to this. These clients would still have the fallback logic for other cases, and excluding this case seems like more work than including it, for client implementors. For operators, excluding this fallback increases operational fragility in the event of error, and conveys no obvious benefit. > > - NXDOMAIN handling of parallel queries > - When there are parallel queries for (SVCB or HTTPS) records along > with A and AAAA records, an NXDOMAIN response for any of them MUST be > treated as an NXDOMAIN result for all of them. (This is a tautology, > BTW.) > - It may be worth adding words to that effect, so that implementers > can avoid delays waiting for now-moot queries. This would allow faster > progression to alternative ServiceMode records, and/or terminating all > queries (if no path forward exists at an AliasMode record). > > I'm not aware of any such rule in Happy Eyeballs, which is the basis for this kind of parallel querying. Diverging from the Happy Eyeballs rules (or overspecifying the behavior here to conflict with Happy Eyeballs) would prevent client implementors from reusing their Happy Eyeballs implementation. This optimization is an interesting observation, but it seems clear to me that a NXDOMAIN TargetName is always an operator error, so I don't think it is worth defining performance optimizations for that case. > Sorry for the late timing of this. > We (authoritative DNS implementers) considered the spec not terribly clear > but at least unambiguous. > It was only when communicating with browser vendors doing implementation > of the client side (Google Chrome in particular) that the issues surfaced. > > I think we all want this to be consistently implemented, and to be > consistent with the authors' intents. > > Please correct me if my overall understanding (AliasMode == CNAME at apex, > including obeying ALL of the behavior limits and RCODE results associated > with CNAME) isn't corect. > It's certainly not identical to CNAME in general, as it only applies to a single "scheme" on the hostname, not the entire hostname. However, it is a fairly close parallel, including the failure modes, in the SVCB-required case. For now, SVCB-optional is the common case, because most protocols predate SVCB, and here there is a substantial difference because both the beginning and the end(s) of the chain are considered valid endpoints.
_______________________________________________ DNSOP mailing list DNSOP@ietf.org https://www.ietf.org/mailman/listinfo/dnsop