[DNSOP] Re: Character encoding in DNS

Petr Menšík Mon, 24 Nov 2025 16:41:03 -0800

On 22/11/2025 03:59, Andrew Sullivan wrote:

On Sat, Nov 22, 2025 at 02:31:41AM -0500, Petr Menšík wrote:
Yes, it is binary data. Any binary content is permitted. It onlydepends in what ways you choose to display it.
Then the requirement about UTF-8 is vacuous and should be removed fromthe document. The problem with that approach, of course, is that theinteroperability argument for standardizing this at all is ratherharmed. But since there's a semantics to the tags in a TXT RRimplementing the specification, it's sort of hard to believe it'sreally just a matter of whatever the display wants to do.

I would not say it a requirement. I am not sure what document are youreferring to. This is an idea without a formal draft. fval ofdraft-davids-forsalereg is a good example. "v=FORSALE1;fval=€999" isgood for people and should be printed in a human friendly way, IMO.

How we got from binary only data to automatic processing,normalization form of some kind? If it contains only printablecharacters encoded in verified encoding, it is reasonably safe to notescape each byte.
To the extent I understand this, I'm pretty sure I disagree. I don'tthink this is the right list for elementary education about textencoding; but you seem already to have "only printable characters" inyour assumptions there, so perhaps I'll ask you this: Are ZWJ and ZWNJprintable or not? If you don't understand the question, don't knowwhat the answer is, or don't know why that is a trick question then Isuggest you can't wave away this set of problems. If instead you wantto say that literally _any_ binary data is allowed, then say that. Ifyou want to suggest something other than some PRECIS profile (I thinkit was in this thread that Paul Hoffman made a different suggestion),then do that. But the IETF has made a hash of internationalizationover and over again precisely because of specification writers wavingaway the problems inherent in writing systems and their encodingsonline. Perhaps not ironically, one of the earliest examples of thisis the DNS itself, which is "8 bit clean" except, of course, for thatlittle part where some bits match other bits in some cases (this iscase folding). The intermediate systems are supposed to cache theoriginal form, but that doesn't always happen.

Python3 says ZWJ is not printable. These definitions are corner cases.It is up to higher layers to render text. That is job of GUI toolkits,terminal emulators and similar.


"\u200C".isprintable() == False

I think in native code iswprint() should be used to guess, whether toescape or not. When the first codepoint is non-printable, then theremaining can be escaped. It does not seem to be text for humans. Ofcourse isprint() must not be used on raw undecoded UTF-8 bytes to havesensible results.

PRECIS or RFC 9839 is an implementation detail. As long as they bothresult in "háčkyčárky".isprintable() == True, any of them is fine. Ithink for printing the text RFC 9839 is sufficient and simpler form, Iwould recommend to have used in DNS software. They do not need to knowwhat is uppercase letter or a digit in arabic. They need to escape onlyany Problematic Code Points specified and print the rest of utf-8encoded text as a normal text. If a domain name can contain socks icon,why not a TXT record? Why are no emoticons rendered?

record, or only in the ftxt subtype. For instance, is the host partof an furi entry required to be an ASCII string (i.e. if it's anIDN, must it be the A-label form?) or may it include UTF-8 stringsbeyond the ASCII-equivalent range? It seems to me it would bevaluable to specify which is meant.
Domain labels are out of scope, IDN is unrelated.
I find that a little hard to swallow given that the content of a furientry is a URI.

URI can have normalized form of only A-label input with percent encodedpath characters in URI. It can be presented to user in U-label formwithout percent escaping in paths. In a nice way.

https://háčkyčárky.cz/ can be sent on-wire ashttps://xn--hkyrky-ptac70bc.cz/ without losing any information. Remoteparty can display it in form nice to humans. Most DNS tools do notpresent TXT records in similar form to users.

They have to be compared case-insensitive, which require to decodeeach code point and locate proper lowercase/uppercase letter matchingthe source.
Great. What is the uppercase match of â? Is it different to theuppercase match of ä? All the time in every language? (a hint:IDNA2008 solves this problem for you, so there's never a case wherethe answer to this is ambiguous for an a-label/u-label pair.)

Can you specify where exactly do I need that information? I want onlyprinted TXT records with text not in english. With an exception ofMulticast DNS label, where IDNA 2008 is not used, but utf-8 directly is.The only question is whether I need to do escaping or not whenpresenting response.

Content of records remain application specific. If I look ongoogle.com TXT response, I do not see any escaped data. Even if itcontains also binary contents in some base64 encoding.
Sure, but we weren't talking about any TXT record. I was talkingabout draft-davids-forsalereg, not google.com TXT records, so Idon't understand how they are relevant.
Best regards,

A

Best Regards,
Petr

--
Petr Menšík
Senior Software Engineer, RHEL
Red Hat, https://www.redhat.com/
PGP: DFCF908DB7C87E8E529925BC4931CA5B6C9FC5CB

_______________________________________________
DNSOP mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[DNSOP] Re: Character encoding in DNS

Reply via email to