(belatedly, inline)

On 20 Mar 2025, at 6:27, Carsten Bormann wrote:

On 20. Mar 2025, at 07:11, Robert Sparks <rjspa...@nostrum.com> wrote:


On 3/20/25 11:09 AM, Carsten Bormann wrote:
On 20. Mar 2025, at 04:45, Jean Mahoney <jmaho...@staff.rfc-editor.org> wrote:
[JM] TEXT is used for RFCs created in the RFCXML v3 era. ASCII is for older RFCs. The TEXT label indicates the file can contain non-ASCII characters [2].
There are a dozen or so pre-v3 RFCs that are beyond-ASCII.
(And actually a couple that aren’t even UTF-8!)
Pointers to the non-UTF8 encoded RFCs please?

I didn’t take notes when I last checked this, but I can do the check again.

Let’s start with:
rfc101 rfc177 rfc178 rfc182 rfc227 rfc234 rfc235 rfc237 rfc243 rfc270
rfc282 rfc288 rfc290 rfc292 rfc303 rfc306 rfc307 rfc310 rfc313 rfc315
rfc316 rfc317 rfc323 rfc327 rfc367 rfc369 rfc441 rfc2497 rfc2557
rfc2708 rfc2875

For info, here are a few RFCs that are not v3 but not ASCII either:
rfc8187 rfc8264 rfc8265 rfc8266

And then there are the RFCs that contain NUL bytes, like RFC 674…
I didn’t do a full categorization of these critters.

We have the following, although it’s been many years since it was checked for accuracy:

```
    def charset(self) -> str:
        """
Most RFCs are UTF-8, or it's ASCII subset. A few are not. Return
        an appropriate encoding for the text of this RFC.
        """
if (self.doc_id == "RFC0064") or (self.doc_id == "RFC0101") or \ (self.doc_id == "RFC0177") or (self.doc_id == "RFC0178") or \ (self.doc_id == "RFC0182") or (self.doc_id == "RFC0227") or \ (self.doc_id == "RFC0234") or (self.doc_id == "RFC0235") or \ (self.doc_id == "RFC0237") or (self.doc_id == "RFC0243") or \ (self.doc_id == "RFC0270") or (self.doc_id == "RFC0282") or \ (self.doc_id == "RFC0288") or (self.doc_id == "RFC0290") or \ (self.doc_id == "RFC0292") or (self.doc_id == "RFC0303") or \ (self.doc_id == "RFC0306") or (self.doc_id == "RFC0307") or \ (self.doc_id == "RFC0310") or (self.doc_id == "RFC0313") or \ (self.doc_id == "RFC0315") or (self.doc_id == "RFC0316") or \ (self.doc_id == "RFC0317") or (self.doc_id == "RFC0323") or \ (self.doc_id == "RFC0327") or (self.doc_id == "RFC0367") or \ (self.doc_id == "RFC0369") or (self.doc_id == "RFC0441") or \
             (self.doc_id == "RFC1305"):
            return "iso8859_1"
        elif self.doc_id == "RFC2166":
            return "windows-1252"
        elif (self.doc_id == "RFC2497") or (self.doc_id == "RFC2557"):
            return "iso8859_1"
        elif self.doc_id == "RFC2708":
# This RFC is corrupt: line 521 has a byte with value 0xC6 that # is clearly intended to be a ' character, but that code point # doesn't correspond to ' in any character set I can find. Use # ISO 8859-1 which gets all characters right apart from this.
            #
# According to Greg Skinner: "regarding the test in line 268 # for RFC2708, as far as I can tell, U+0092 was introduced in # draft-ietf-printmib-job-protomap-01 in multiple places. In -02, # it was replaced with U+0027 everywhere except section 5.0.
            # Somehow, that stray character became the corrupt text you
            # identified."
            # (https://github.com/glasgow-ipl/ietfdata/issues/137)
            return "iso8859_1"
        elif self.doc_id == "RFC2875":
# Both the text and PDF versions of this document have corrupt
            # characters (lines 754 and 926 of the text version). Using
            # ISO 8859-1 is no more corrupt than the original.
            return "iso8859_1"
        else:
            return "utf-8"

```

Cheers,
Colin
_______________________________________________
rfc-interest mailing list -- rfc-interest@rfc-editor.org
To unsubscribe send an email to rfc-interest-le...@rfc-editor.org

Reply via email to