On Fri, 4 May 2001, Daniel Resare wrote: > On Fri, May 04, 2001 at 11:01:50AM +0100, Cameron wrote: > > Essentially, i'm working on some iDNS stuff, and i'm looking for a nice easy > > way to detect whether a string contains a utf8 character. I've looked > > it doesn't seem to work on the variables i've passed from a cgi script. of ....... > > do a reliable detection. An example of an input source that is probably not > > long enough would be a search widget on a web page." > I wrote a small utility that checks a string for UTF-8 validity a while ago > and I found out that out of approximately 500k lines of varying charset > that contained characters with 8th bit set (gnome translations) about > 0.02% of the lines passed as UTF-8 that was not, and almost all of them > were single words in korean. So, you can not be positively sure a given > string really is UTF-8, but you can make a good guess. I'm curious as to how good your utils is when tested against non-UTF-8 text (mostly of a single word or otherwise very short) other than Korean (most likely in EUC-KR). Korean in EUC-KR (like text in any other EUC encodings such as EUC-JP, EUC-CN, and EUC-TW. well, EUC-JP and EUC-TW use a couple of more octets to represent code set 2 and 3 not used in EUC-CN and EUC-KR) uses a pretty 'limited' range of octets (either [0x20-0x7E] or two octets sequences of [0xA1-0xFE]), which, I'm afraid, could lead to a better test result than you can get in a more generic case. Could you try your utils against Shift-JIS, Big5 and some Russian encodings(which are not ISO-2022 compliant) ? Gnome translation packs should have translation in EUC-JP (if not in Shift-JIS) and you can convert them to Shift-JIS with iconv to test your util. Thank you, Jungshik Shin - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/
