I'm using Nutch to fetch a fairly large number of URLs, and I'm finding that I'm getting all kinds "character set" values from servers. Below is a small sample. Does anyone on this list know of a good/complete list of valid values?
See <http://www.iana.org/assignments/character-sets>.
What I'm really looking for is what the maximum length of XXX in "charset=XXX" portion of the Content-type HTTP response header. Does anyone know what the maximum length is? I thought it was 12, but even from this small sample below, I see there are some of length 14.
I believe I found somewhere that the maximum length of a valid IANA charset name is 40 bytes. Or at least that's what I defined it to be for Palm OS.
In the list below, some of these seem bogus to me. For example, ".UTF8" isn't valid. Many are aliases, for example UTF8 and "UTF-8". Hmm, lots of these aren't listed on the IANA web site, for example X-SJIS (should be SHIFT_JIS), UTF8 (should be UTF-8), ISO8859-1 (should be ISO-8859-1), etc.
If you send me a full list, I can tell you which ones are valid IANA, and which ones should be set up as aliases.
-- Ken
---- ENC: 646 ---- ENC: 8859_1 ---- ENC: 8859-15 ---- ENC: BIG5 ---- ENC: CP1251 ---- ENC: EN ---- ENC: EUC-JP ---- ENC: EUC-KR ---- ENC: GB2312 ---- ENC: GBK ---- ENC: ISO-2022-UTF-8 ---- ENC: ISO-8859-1 ---- ENC: ISO8859_1 ---- ENC: ISO8859-1 ---- ENC: ISO-8859-15 ---- ENC: ISO-8859-2 ---- ENC: ISO-8859-7 ---- ENC: ISO-8859-8-I ---- ENC: KOI8-R ---- ENC: KS_C_5601 ---- ENC: KS_C_5601-1987 ---- ENC: MACINTOSH ---- ENC: SHIFT_JIS ---- ENC: TIS-620 ---- ENC: US-ASCII ---- ENC: UTF8 ---- ENC: .UTF8 ---- ENC: UTF-8 ---- ENC: WINDOWS-1250 ---- ENC: WINDOWS-1251 ---- ENC: WINDOWS-1252 ---- ENC: WINDOWS-1254 ---- ENC: WINDOWS-1255 ---- ENC: WINDOWS-1256 ---- ENC: X-SJIS ---- ENC: X-USER-DEFINED
-- Ken Krugler TransPac Software, Inc. <http://www.transpac.com> +1 530-470-9200
