Re: Perl & unicode weirdness.

Markus Kuhn Sat, 31 Jan 2004 07:10:00 -0800

The way in which Perl supports Unicode, you normally should hardly ever
have to call a UTF-8 encoder or decoder explicitely and manually. You
just have to make sure that when a UTF-8 string enters Perl, it does so
tagged as a UTF-8 string and not as an octet string. How that happens
depends on how the string gets into Perl. When opening files, for
instance, you can tell Perl the charset to expect or to look at the
LC_CTYPE locale.


Perl Unicode support before 5.8.0 was experimental, incomplete and in
practice not useable. Perl 5.8.0 worked pretty smoothly for me, I
discovered in my own use only one single UTF-8-related bug to do with
regular expressions, and that was fixed in 5.8.1.

man perluniintro

I had a lot of Perl 5.0 script that processed UTF-8 before there was any
UTF-8 support in Perl. They continue to work with "use byte;" added, but
they got significantly simpler by using Perls Unicode facilities.

Question: What is a quick way in Perl to get a regular expression that
matches all Unicode characters in the range U0100..U10FFFF, in other
words all non-ASCII Unicode characters?

Markus

-- 
Markus Kuhn, Computer Lab, Univ of Cambridge, GB
http://www.cl.cam.ac.uk/~mgk25/ | __oo_O..O_oo__


--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl & unicode weirdness.

Reply via email to