Hi,

Thank you, that's really useful.

I am going to use mime types in my database instead of the TEXT/ASCII labels.  
So text/plain with the charset listed below, or application/octet-stream for 
the corrupted RFCs.

On 4/14/25 12:40 AM, Colin Perkins wrote:
> (belatedly, inline)
> 
> On 20 Mar 2025, at 6:27, Carsten Bormann wrote:
> 
>> On 20. Mar 2025, at 07:11, Robert Sparks <rjspa...@nostrum.com> wrote:
>>>
>>>
>>> On 3/20/25 11:09 AM, Carsten Bormann wrote:
>>>> On 20. Mar 2025, at 04:45, Jean Mahoney <jmaho...@staff.rfc-editor.org> 
>>>> wrote:
>>>>> [JM] TEXT is used for RFCs created in the RFCXML v3 era. ASCII is for 
>>>>> older RFCs. The TEXT label indicates the file can contain non-ASCII 
>>>>> characters [2].
>>>> There are a dozen or so pre-v3 RFCs that are beyond-ASCII.
>>>> (And actually a couple that aren’t even UTF-8!)
>>> Pointers to the non-UTF8 encoded RFCs please?
>>
>> I didn’t take notes when I last checked this, but I can do the check again.
>>
>> Let’s start with:
>> rfc101 rfc177 rfc178 rfc182 rfc227 rfc234 rfc235 rfc237 rfc243 rfc270
>> rfc282 rfc288 rfc290 rfc292 rfc303 rfc306 rfc307 rfc310 rfc313 rfc315
>> rfc316 rfc317 rfc323 rfc327 rfc367 rfc369 rfc441 rfc2497 rfc2557
>> rfc2708 rfc2875
>>
>> For info, here are a few RFCs that are not v3 but not ASCII either:
>> rfc8187 rfc8264 rfc8265 rfc8266
>>
>> And then there are the RFCs that contain NUL bytes, like RFC 674…
>> I didn’t do a full categorization of these critters.
> 
> We have the following, although it’s been many years since it was checked for 
> accuracy:
> 
> ```
>     def charset(self) -> str:
>         """
>         Most RFCs are UTF-8, or it's ASCII subset. A few are not. Return
>         an appropriate encoding for the text of this RFC.
>         """
>         if   (self.doc_id == "RFC0064") or (self.doc_id == "RFC0101") or \
>              (self.doc_id == "RFC0177") or (self.doc_id == "RFC0178") or \
>              (self.doc_id == "RFC0182") or (self.doc_id == "RFC0227") or \
>              (self.doc_id == "RFC0234") or (self.doc_id == "RFC0235") or \
>              (self.doc_id == "RFC0237") or (self.doc_id == "RFC0243") or \
>              (self.doc_id == "RFC0270") or (self.doc_id == "RFC0282") or \
>              (self.doc_id == "RFC0288") or (self.doc_id == "RFC0290") or \
>              (self.doc_id == "RFC0292") or (self.doc_id == "RFC0303") or \
>              (self.doc_id == "RFC0306") or (self.doc_id == "RFC0307") or \
>              (self.doc_id == "RFC0310") or (self.doc_id == "RFC0313") or \
>              (self.doc_id == "RFC0315") or (self.doc_id == "RFC0316") or \
>              (self.doc_id == "RFC0317") or (self.doc_id == "RFC0323") or \
>              (self.doc_id == "RFC0327") or (self.doc_id == "RFC0367") or \
>              (self.doc_id == "RFC0369") or (self.doc_id == "RFC0441") or \
>              (self.doc_id == "RFC1305"):
>             return "iso8859_1"
>         elif self.doc_id == "RFC2166":
>             return "windows-1252"
>         elif (self.doc_id == "RFC2497") or (self.doc_id == "RFC2557"):
>             return "iso8859_1"
>         elif self.doc_id == "RFC2708":
>             # This RFC is corrupt: line 521 has a byte with value 0xC6 that
>             # is clearly intended to be a ' character, but that code point
>             # doesn't correspond to ' in any character set I can find. Use
>             # ISO 8859-1 which gets all characters right apart from this.
>             #
>             # According to Greg Skinner: "regarding the test in line 268
>             # for RFC2708, as far as I can tell, U+0092 was introduced in
>             # draft-ietf-printmib-job-protomap-01 in multiple places. In -02,
>             # it was replaced with U+0027 everywhere except section 5.0.
>             # Somehow, that stray character became the corrupt text you
>             # identified."
>             # (https://github.com/glasgow-ipl/ietfdata/issues/137)
>             return "iso8859_1"
>         elif self.doc_id == "RFC2875":
>             # Both the text and PDF versions of this document have corrupt
>             # characters (lines 754 and 926 of the text version). Using
>             # ISO 8859-1 is no more corrupt than the original.
>             return "iso8859_1"
>         else:
>             return "utf-8"
> 
> ```
> 
> Cheers,
> Colin
> 

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

_______________________________________________
rfc-interest mailing list -- rfc-interest@rfc-editor.org
To unsubscribe send an email to rfc-interest-le...@rfc-editor.org

Reply via email to