Re: Detection of UTF-8 characters in perl.

Jungshik Shin Sat, 05 May 2001 15:40:29 -0700
On Fri, 4 May 2001, Daniel Resare wrote:

> On Fri, May 04, 2001 at 11:01:50AM +0100, Cameron wrote:

> > Essentially, i'm working on some iDNS stuff, and i'm looking for a nice easy
> > way to detect whether a string contains a utf8 character. I've looked
> > it doesn't seem to work on the variables i've passed from a cgi script. of
.......
> > do a reliable detection. An example of an input source that is probably not
> > long enough would be a search widget on a web page."

> I wrote a small utility that checks a string for UTF-8 validity a while ago
> and I found out that out of approximately 500k lines of varying charset
> that contained characters with 8th bit set (gnome translations) about
> 0.02% of the lines passed as UTF-8 that was not, and almost all of them
> were single words in korean. So, you can not be positively sure a given
> string really is UTF-8, but you can make a good guess.

  I'm curious  as to how good  your utils is when tested against
non-UTF-8 text (mostly of a single word or otherwise very short)
other than Korean (most likely in EUC-KR). Korean in EUC-KR (like text
in  any other EUC encodings such as EUC-JP, EUC-CN, and EUC-TW. well,
EUC-JP and EUC-TW use a couple of more octets to represent code set
2 and 3 not used in EUC-CN and EUC-KR)  uses a pretty 'limited' range
of octets (either [0x20-0x7E] or two octets sequences of [0xA1-0xFE]),
which, I'm afraid, could lead to a better test result than you can get
in a more generic case.

  Could  you try your utils against Shift-JIS,
Big5 and some Russian encodings(which are not ISO-2022 compliant) ? Gnome
translation packs should have translation in EUC-JP (if not in Shift-JIS)
and you can convert them to Shift-JIS with iconv to test your util.

   Thank you,

   Jungshik Shin

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Detection of UTF-8 characters in perl.

Reply via email to