[rfc-i] Re: Mutable properties of RFCs

Colin Perkins Tue, 15 Apr 2025 01:29:12 -0700

(belatedly, inline)

On 20 Mar 2025, at 6:27, Carsten Bormann wrote:

On 20. Mar 2025, at 07:11, Robert Sparks <rjspa...@nostrum.com> wrote:

On 3/20/25 11:09 AM, Carsten Bormann wrote:
On 20. Mar 2025, at 04:45, Jean Mahoney<jmaho...@staff.rfc-editor.org> wrote:
[JM] TEXT is used for RFCs created in the RFCXML v3 era. ASCII isfor older RFCs. The TEXT label indicates the file can containnon-ASCII characters [2].
There are a dozen or so pre-v3 RFCs that are beyond-ASCII.
(And actually a couple that aren’t even UTF-8!)
Pointers to the non-UTF8 encoded RFCs please?

I didn’t take notes when I last checked this, but I can do the checkagain.


Let’s start with:
rfc101 rfc177 rfc178 rfc182 rfc227 rfc234 rfc235 rfc237 rfc243 rfc270
rfc282 rfc288 rfc290 rfc292 rfc303 rfc306 rfc307 rfc310 rfc313 rfc315
rfc316 rfc317 rfc323 rfc327 rfc367 rfc369 rfc441 rfc2497 rfc2557
rfc2708 rfc2875

For info, here are a few RFCs that are not v3 but not ASCII either:
rfc8187 rfc8264 rfc8265 rfc8266

And then there are the RFCs that contain NUL bytes, like RFC 674…
I didn’t do a full categorization of these critters.

We have the following, although it’s been many years since it waschecked for accuracy:


```
    def charset(self) -> str:
        """

Most RFCs are UTF-8, or it's ASCII subset. A few are not.Return

        an appropriate encoding for the text of this RFC.
        """

if (self.doc_id == "RFC0064") or (self.doc_id == "RFC0101")or \(self.doc_id == "RFC0177") or (self.doc_id == "RFC0178")or \(self.doc_id == "RFC0182") or (self.doc_id == "RFC0227")or \(self.doc_id == "RFC0234") or (self.doc_id == "RFC0235")or \(self.doc_id == "RFC0237") or (self.doc_id == "RFC0243")or \(self.doc_id == "RFC0270") or (self.doc_id == "RFC0282")or \(self.doc_id == "RFC0288") or (self.doc_id == "RFC0290")or \(self.doc_id == "RFC0292") or (self.doc_id == "RFC0303")or \(self.doc_id == "RFC0306") or (self.doc_id == "RFC0307")or \(self.doc_id == "RFC0310") or (self.doc_id == "RFC0313")or \(self.doc_id == "RFC0315") or (self.doc_id == "RFC0316")or \(self.doc_id == "RFC0317") or (self.doc_id == "RFC0323")or \(self.doc_id == "RFC0327") or (self.doc_id == "RFC0367")or \(self.doc_id == "RFC0369") or (self.doc_id == "RFC0441")or \

             (self.doc_id == "RFC1305"):
            return "iso8859_1"
        elif self.doc_id == "RFC2166":
            return "windows-1252"
        elif (self.doc_id == "RFC2497") or (self.doc_id == "RFC2557"):
            return "iso8859_1"
        elif self.doc_id == "RFC2708":

# This RFC is corrupt: line 521 has a byte with value 0xC6that# is clearly intended to be a ' character, but that codepoint# doesn't correspond to ' in any character set I can find.Use# ISO 8859-1 which gets all characters right apart fromthis.

# According to Greg Skinner: "regarding the test in line268# for RFC2708, as far as I can tell, U+0092 was introducedin# draft-ietf-printmib-job-protomap-01 in multiple places.In -02,# it was replaced with U+0027 everywhere except section5.0.

            # Somehow, that stray character became the corrupt text you
            # identified."
            # (https://github.com/glasgow-ipl/ietfdata/issues/137)
            return "iso8859_1"
        elif self.doc_id == "RFC2875":

# Both the text and PDF versions of this document havecorrupt

            # characters (lines 754 and 926 of the text version). Using
            # ISO 8859-1 is no more corrupt than the original.
            return "iso8859_1"
        else:
            return "utf-8"

```

Cheers,
Colin

_______________________________________________
rfc-interest mailing list -- rfc-interest@rfc-editor.org
To unsubscribe send an email to rfc-interest-le...@rfc-editor.org

[rfc-i] Re: Mutable properties of RFCs

Reply via email to