Hi, Thank you, that's really useful.
I am going to use mime types in my database instead of the TEXT/ASCII labels. So text/plain with the charset listed below, or application/octet-stream for the corrupted RFCs. On 4/14/25 12:40 AM, Colin Perkins wrote: > (belatedly, inline) > > On 20 Mar 2025, at 6:27, Carsten Bormann wrote: > >> On 20. Mar 2025, at 07:11, Robert Sparks <rjspa...@nostrum.com> wrote: >>> >>> >>> On 3/20/25 11:09 AM, Carsten Bormann wrote: >>>> On 20. Mar 2025, at 04:45, Jean Mahoney <jmaho...@staff.rfc-editor.org> >>>> wrote: >>>>> [JM] TEXT is used for RFCs created in the RFCXML v3 era. ASCII is for >>>>> older RFCs. The TEXT label indicates the file can contain non-ASCII >>>>> characters [2]. >>>> There are a dozen or so pre-v3 RFCs that are beyond-ASCII. >>>> (And actually a couple that aren’t even UTF-8!) >>> Pointers to the non-UTF8 encoded RFCs please? >> >> I didn’t take notes when I last checked this, but I can do the check again. >> >> Let’s start with: >> rfc101 rfc177 rfc178 rfc182 rfc227 rfc234 rfc235 rfc237 rfc243 rfc270 >> rfc282 rfc288 rfc290 rfc292 rfc303 rfc306 rfc307 rfc310 rfc313 rfc315 >> rfc316 rfc317 rfc323 rfc327 rfc367 rfc369 rfc441 rfc2497 rfc2557 >> rfc2708 rfc2875 >> >> For info, here are a few RFCs that are not v3 but not ASCII either: >> rfc8187 rfc8264 rfc8265 rfc8266 >> >> And then there are the RFCs that contain NUL bytes, like RFC 674… >> I didn’t do a full categorization of these critters. > > We have the following, although it’s been many years since it was checked for > accuracy: > > ``` > def charset(self) -> str: > """ > Most RFCs are UTF-8, or it's ASCII subset. A few are not. Return > an appropriate encoding for the text of this RFC. > """ > if (self.doc_id == "RFC0064") or (self.doc_id == "RFC0101") or \ > (self.doc_id == "RFC0177") or (self.doc_id == "RFC0178") or \ > (self.doc_id == "RFC0182") or (self.doc_id == "RFC0227") or \ > (self.doc_id == "RFC0234") or (self.doc_id == "RFC0235") or \ > (self.doc_id == "RFC0237") or (self.doc_id == "RFC0243") or \ > (self.doc_id == "RFC0270") or (self.doc_id == "RFC0282") or \ > (self.doc_id == "RFC0288") or (self.doc_id == "RFC0290") or \ > (self.doc_id == "RFC0292") or (self.doc_id == "RFC0303") or \ > (self.doc_id == "RFC0306") or (self.doc_id == "RFC0307") or \ > (self.doc_id == "RFC0310") or (self.doc_id == "RFC0313") or \ > (self.doc_id == "RFC0315") or (self.doc_id == "RFC0316") or \ > (self.doc_id == "RFC0317") or (self.doc_id == "RFC0323") or \ > (self.doc_id == "RFC0327") or (self.doc_id == "RFC0367") or \ > (self.doc_id == "RFC0369") or (self.doc_id == "RFC0441") or \ > (self.doc_id == "RFC1305"): > return "iso8859_1" > elif self.doc_id == "RFC2166": > return "windows-1252" > elif (self.doc_id == "RFC2497") or (self.doc_id == "RFC2557"): > return "iso8859_1" > elif self.doc_id == "RFC2708": > # This RFC is corrupt: line 521 has a byte with value 0xC6 that > # is clearly intended to be a ' character, but that code point > # doesn't correspond to ' in any character set I can find. Use > # ISO 8859-1 which gets all characters right apart from this. > # > # According to Greg Skinner: "regarding the test in line 268 > # for RFC2708, as far as I can tell, U+0092 was introduced in > # draft-ietf-printmib-job-protomap-01 in multiple places. In -02, > # it was replaced with U+0027 everywhere except section 5.0. > # Somehow, that stray character became the corrupt text you > # identified." > # (https://github.com/glasgow-ipl/ietfdata/issues/137) > return "iso8859_1" > elif self.doc_id == "RFC2875": > # Both the text and PDF versions of this document have corrupt > # characters (lines 754 and 926 of the text version). Using > # ISO 8859-1 is no more corrupt than the original. > return "iso8859_1" > else: > return "utf-8" > > ``` > > Cheers, > Colin >
OpenPGP_signature.asc
Description: OpenPGP digital signature
_______________________________________________ rfc-interest mailing list -- rfc-interest@rfc-editor.org To unsubscribe send an email to rfc-interest-le...@rfc-editor.org