Re: detecting a UTF-8 string

Octavian Rasnita Wed, 03 Jan 2007 07:49:17 -0800

From: "Jay Savage" <[EMAIL PROTECTED]>

Try to unpack the data--or a chunk of data you feel is large enough to
be representative--with the pattern U0U*. If the unpack succeeds with
no warnings, you have valid utf8. You could try the same thing with
Encode's 'decode_utf8' routine. See perluniintro for details. in both
cases, though, you need to make sure that you've grabbed well-formed
utf8 from the source file in the first place. If the data cuts off in
the middle of a multi-byte character, you'll get an error.


I have tried verifying the entire string, using the following:

my $result = unpack("U0U*", $content);
print $result;

Well, it gave no errors even though the string was UTF-8 or not, but aninteresting thing is that the result printed was always 65279 if the stringwas UTF-8 and 112 or 116 if the string was not UTF-8.

Do you know what represent these numbers? I am curious why sometimes itprints 112 and sometimes 116 when using some ansi strings.I hope the result is consistent and I can base on it to use the code in myprogram for checking if a string is UTF-8.


Thank you.

Octavian


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: detecting a UTF-8 string

Reply via email to