I'm using Nutch to fetch a fairly large number of URLs, and I'm finding
that I'm getting all kinds "character set" values from servers.  Below
is a small sample.  Does anyone on this list know of a good/complete
list of valid values?

See <http://www.iana.org/assignments/character-sets>.

What I'm really looking for is what the maximum length of XXX in
"charset=XXX" portion of the Content-type HTTP response header.  Does
anyone know what the maximum length is?  I thought it was 12, but even
from this small sample below, I see there are some of length 14.

I believe I found somewhere that the maximum length of a valid IANA charset name is 40 bytes. Or at least that's what I defined it to be for Palm OS.

In the list below, some of these seem bogus to me. For example, ".UTF8" isn't valid. Many are aliases, for example UTF8 and "UTF-8". Hmm, lots of these aren't listed on the IANA web site, for example X-SJIS (should be SHIFT_JIS), UTF8 (should be UTF-8), ISO8859-1 (should be ISO-8859-1), etc.

If you send me a full list, I can tell you which ones are valid IANA, and which ones should be set up as aliases.

-- Ken



---- ENC: 646
---- ENC: 8859_1
---- ENC: 8859-15
---- ENC: BIG5
---- ENC: CP1251
---- ENC: EN
---- ENC: EUC-JP
---- ENC: EUC-KR
---- ENC: GB2312
---- ENC: GBK
---- ENC: ISO-2022-UTF-8
---- ENC: ISO-8859-1
---- ENC: ISO8859_1
---- ENC: ISO8859-1
---- ENC: ISO-8859-15
---- ENC: ISO-8859-2
---- ENC: ISO-8859-7
---- ENC: ISO-8859-8-I
---- ENC: KOI8-R
---- ENC: KS_C_5601
---- ENC: KS_C_5601-1987
---- ENC: MACINTOSH
---- ENC: SHIFT_JIS
---- ENC: TIS-620
---- ENC: US-ASCII
---- ENC: UTF8
---- ENC: .UTF8
---- ENC: UTF-8
---- ENC: WINDOWS-1250
---- ENC: WINDOWS-1251
---- ENC: WINDOWS-1252
---- ENC: WINDOWS-1254
---- ENC: WINDOWS-1255
---- ENC: WINDOWS-1256
---- ENC: X-SJIS
---- ENC: X-USER-DEFINED


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Reply via email to