Hello, I'm using Nutch to fetch a fairly large number of URLs, and I'm finding that I'm getting all kinds "character set" values from servers. Below is a small sample. Does anyone on this list know of a good/complete list of valid values? What I'm really looking for is what the maximum length of XXX in "charset=XXX" portion of the Content-type HTTP response header. Does anyone know what the maximum length is? I thought it was 12, but even from this small sample below, I see there are some of length 14.
---- ENC: 646 ---- ENC: 8859_1 ---- ENC: 8859-15 ---- ENC: BIG5 ---- ENC: CP1251 ---- ENC: EN ---- ENC: EUC-JP ---- ENC: EUC-KR ---- ENC: GB2312 ---- ENC: GBK ---- ENC: ISO-2022-UTF-8 ---- ENC: ISO-8859-1 ---- ENC: ISO8859_1 ---- ENC: ISO8859-1 ---- ENC: ISO-8859-15 ---- ENC: ISO-8859-2 ---- ENC: ISO-8859-7 ---- ENC: ISO-8859-8-I ---- ENC: KOI8-R ---- ENC: KS_C_5601 ---- ENC: KS_C_5601-1987 ---- ENC: MACINTOSH ---- ENC: SHIFT_JIS ---- ENC: TIS-620 ---- ENC: US-ASCII ---- ENC: UTF8 ---- ENC: .UTF8 ---- ENC: UTF-8 ---- ENC: WINDOWS-1250 ---- ENC: WINDOWS-1251 ---- ENC: WINDOWS-1252 ---- ENC: WINDOWS-1254 ---- ENC: WINDOWS-1255 ---- ENC: WINDOWS-1256 ---- ENC: X-SJIS ---- ENC: X-USER-DEFINED Thanks, Otis
