Hello,

I'm using Nutch to fetch a fairly large number of URLs, and I'm finding
that I'm getting all kinds "character set" values from servers.  Below
is a small sample.  Does anyone on this list know of a good/complete
list of valid values?
What I'm really looking for is what the maximum length of XXX in 
"charset=XXX" portion of the Content-type HTTP response header.  Does
anyone know what the maximum length is?  I thought it was 12, but even
from this small sample below, I see there are some of length 14.

---- ENC: 646
---- ENC: 8859_1
---- ENC: 8859-15
---- ENC: BIG5
---- ENC: CP1251
---- ENC: EN
---- ENC: EUC-JP
---- ENC: EUC-KR
---- ENC: GB2312
---- ENC: GBK
---- ENC: ISO-2022-UTF-8
---- ENC: ISO-8859-1
---- ENC: ISO8859_1
---- ENC: ISO8859-1
---- ENC: ISO-8859-15
---- ENC: ISO-8859-2
---- ENC: ISO-8859-7
---- ENC: ISO-8859-8-I
---- ENC: KOI8-R
---- ENC: KS_C_5601
---- ENC: KS_C_5601-1987
---- ENC: MACINTOSH
---- ENC: SHIFT_JIS
---- ENC: TIS-620
---- ENC: US-ASCII
---- ENC: UTF8
---- ENC: .UTF8
---- ENC: UTF-8
---- ENC: WINDOWS-1250
---- ENC: WINDOWS-1251
---- ENC: WINDOWS-1252
---- ENC: WINDOWS-1254
---- ENC: WINDOWS-1255
---- ENC: WINDOWS-1256
---- ENC: X-SJIS
---- ENC: X-USER-DEFINED

Thanks,
Otis

Reply via email to