Hello Ken,

40 bytes, thanks.  So far I haven't encountered anything beyond 18
bytes, but it seems that people/software/servers put all kinds of crazy
and invalid character sets in HTTP response headers, I may even have to
just cut everything beyond 40 bytes one day.

Otis


--- Ken Krugler <[EMAIL PROTECTED]> wrote:

> >I'm using Nutch to fetch a fairly large number of URLs, and I'm
> finding
> >that I'm getting all kinds "character set" values from servers. 
> Below
> >is a small sample.  Does anyone on this list know of a good/complete
> >list of valid values?
> 
> See <http://www.iana.org/assignments/character-sets>.
> 
> >What I'm really looking for is what the maximum length of XXX in
> >"charset=XXX" portion of the Content-type HTTP response header. 
> Does
> >anyone know what the maximum length is?  I thought it was 12, but
> even
> >from this small sample below, I see there are some of length 14.
> 
> I believe I found somewhere that the maximum length of a valid IANA 
> charset name is 40 bytes. Or at least that's what I defined it to be 
> for Palm OS.
> 
> In the list below, some of these seem bogus to me. For example, 
> ".UTF8" isn't valid. Many are aliases, for example UTF8 and "UTF-8". 
> Hmm, lots of these aren't listed on the IANA web site, for example 
> X-SJIS (should be SHIFT_JIS), UTF8 (should be UTF-8), ISO8859-1 
> (should be ISO-8859-1), etc.
> 
> If you send me a full list, I can tell you which ones are valid IANA,
> 
> and which ones should be set up as aliases.
> 
> -- Ken
> 
> 
> 
> >---- ENC: 646
> >---- ENC: 8859_1
> >---- ENC: 8859-15
> >---- ENC: BIG5
> >---- ENC: CP1251
> >---- ENC: EN
> >---- ENC: EUC-JP
> >---- ENC: EUC-KR
> >---- ENC: GB2312
> >---- ENC: GBK
> >---- ENC: ISO-2022-UTF-8
> >---- ENC: ISO-8859-1
> >---- ENC: ISO8859_1
> >---- ENC: ISO8859-1
> >---- ENC: ISO-8859-15
> >---- ENC: ISO-8859-2
> >---- ENC: ISO-8859-7
> >---- ENC: ISO-8859-8-I
> >---- ENC: KOI8-R
> >---- ENC: KS_C_5601
> >---- ENC: KS_C_5601-1987
> >---- ENC: MACINTOSH
> >---- ENC: SHIFT_JIS
> >---- ENC: TIS-620
> >---- ENC: US-ASCII
> >---- ENC: UTF8
> >---- ENC: .UTF8
> >---- ENC: UTF-8
> >---- ENC: WINDOWS-1250
> >---- ENC: WINDOWS-1251
> >---- ENC: WINDOWS-1252
> >---- ENC: WINDOWS-1254
> >---- ENC: WINDOWS-1255
> >---- ENC: WINDOWS-1256
> >---- ENC: X-SJIS
> >---- ENC: X-USER-DEFINED
> 
> 
> -- 
> Ken Krugler
> TransPac Software, Inc.
> <http://www.transpac.com>
> +1 530-470-9200
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration
> Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
> _______________________________________________
> Nutch-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 

Reply via email to