This isn't about terminology but the once-again debate about a registry's 
responsibility here.

It's simple to state a policy that says:

If an registered NS record does not function properly, the registrant will be 
notified and the NS record will be removed from the DNS until such time that it 
functions properly.

Nice, simple, clean.  Sounds like something a responsible registry would do.  
But it is on top  of an iceberg of issues.

Issue 1: define "function properly".  That can be done.  Lame, non-responsive, 
and so on.  But as I said privately to the original poster, the "science" of 
bad responses is vastly different from the "science" of no response.  (I recall 
from my experimentation that for some addresses, I could repeat the question 
over 10 times [some seconds apart], maybe 13, and still get back a "first" 
response from the address.  I used the id field to tell the queries apart.  To 
this day, I am astonished by that.)

Issue 2: how is the registrant notified, and what constitutes "success" in 
notifying the registrant?  Is an email to the NOC contact enough?  A robo-call? 
 What if the contact information is inaccurate?  This question is needed to 
tell whether the registry is properly implementing the policy they have.

Issue 3: determining the state of the service.  This is tougher than it seems.  
Multiple vantage points, sampling over time, setting a threshold for how many 
failed responses per time quantum constitute failure, yadda, yadda, yadda.  
Keep in mind, the NS record may be part of an anycast cloud and, if the 
registry is hitting one instance, that one might be affected by a spurious 
traffic flood.

My concern is the liability for false positives in failure testing.  I've been 
at the wrong end of such a test, where the registry had failures on their end 
and pointed the finger at us.  (IPv6 was the subject of the test.)  Even if the 
customer-impact of that was low, we spent a lot of resources pouring through 
logs, contacting service providers, tracing the routes, only to find the error 
was a scripting error by the registry.  I traced that down by meeting the 
tester -in person- and going over the test results.

Issue 4: If the registry pulls the NS record, the operator can't test their 
changes until the registry re-tests.  This makes operating the registration 
harder, the tech doing the work has to either engage the registry tech support 
"live" (include language barrier) or suspend completing the ticket until the 
registry gets around to the next test.

Issue 5: Even if the registry pulls the offending NS record, it might still be 
in the authoritative set, meaning caches will still have it present.  I.e., 
pulling the NS record at the parent is trumped by the child.  (This assumes 
some other NS is working, making the authoritative sset visible.)

Philosophically, in DNS, once a delegation is made, it's the child's.  For 
better or worse, the protocol doesn't equip the registry to "coach" the child 
well.  Any work done towards that is "fighting entropy".  It can be done, but 
consumes energy (instead of producing it).



_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop

Reply via email to