Hello Ken, 40 bytes, thanks. So far I haven't encountered anything beyond 18 bytes, but it seems that people/software/servers put all kinds of crazy and invalid character sets in HTTP response headers, I may even have to just cut everything beyond 40 bytes one day.
Otis --- Ken Krugler <[EMAIL PROTECTED]> wrote: > >I'm using Nutch to fetch a fairly large number of URLs, and I'm > finding > >that I'm getting all kinds "character set" values from servers. > Below > >is a small sample. Does anyone on this list know of a good/complete > >list of valid values? > > See <http://www.iana.org/assignments/character-sets>. > > >What I'm really looking for is what the maximum length of XXX in > >"charset=XXX" portion of the Content-type HTTP response header. > Does > >anyone know what the maximum length is? I thought it was 12, but > even > >from this small sample below, I see there are some of length 14. > > I believe I found somewhere that the maximum length of a valid IANA > charset name is 40 bytes. Or at least that's what I defined it to be > for Palm OS. > > In the list below, some of these seem bogus to me. For example, > ".UTF8" isn't valid. Many are aliases, for example UTF8 and "UTF-8". > Hmm, lots of these aren't listed on the IANA web site, for example > X-SJIS (should be SHIFT_JIS), UTF8 (should be UTF-8), ISO8859-1 > (should be ISO-8859-1), etc. > > If you send me a full list, I can tell you which ones are valid IANA, > > and which ones should be set up as aliases. > > -- Ken > > > > >---- ENC: 646 > >---- ENC: 8859_1 > >---- ENC: 8859-15 > >---- ENC: BIG5 > >---- ENC: CP1251 > >---- ENC: EN > >---- ENC: EUC-JP > >---- ENC: EUC-KR > >---- ENC: GB2312 > >---- ENC: GBK > >---- ENC: ISO-2022-UTF-8 > >---- ENC: ISO-8859-1 > >---- ENC: ISO8859_1 > >---- ENC: ISO8859-1 > >---- ENC: ISO-8859-15 > >---- ENC: ISO-8859-2 > >---- ENC: ISO-8859-7 > >---- ENC: ISO-8859-8-I > >---- ENC: KOI8-R > >---- ENC: KS_C_5601 > >---- ENC: KS_C_5601-1987 > >---- ENC: MACINTOSH > >---- ENC: SHIFT_JIS > >---- ENC: TIS-620 > >---- ENC: US-ASCII > >---- ENC: UTF8 > >---- ENC: .UTF8 > >---- ENC: UTF-8 > >---- ENC: WINDOWS-1250 > >---- ENC: WINDOWS-1251 > >---- ENC: WINDOWS-1252 > >---- ENC: WINDOWS-1254 > >---- ENC: WINDOWS-1255 > >---- ENC: WINDOWS-1256 > >---- ENC: X-SJIS > >---- ENC: X-USER-DEFINED > > > -- > Ken Krugler > TransPac Software, Inc. > <http://www.transpac.com> > +1 530-470-9200 > > > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration > Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click > _______________________________________________ > Nutch-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general >
