Re: Perl & unicode weirdness.

Larry Wall Mon, 02 Feb 2004 13:16:01 -0800

On Sat, Jan 31, 2004 at 02:07:07PM +0000, Markus Kuhn wrote:
: Question: What is a quick way in Perl to get a regular expression that
: matches all Unicode characters in the range U0100..U10FFFF, in other
: words all non-ASCII Unicode characters?


Er, you mean U0080..U10FFFF perchance?  Or did you mean non-latin1?

These all seem to work for me in 5.8.1:

    print if /[^[:ascii:]]/;
    print if /[^\0-\177]/;
    print if /[\x{80}-\x{10fff}]/;

Those will, of course, differ on their interpretation of characters
above U10FFFF. Perl programmers are allowed to think bad thoughts like
that...Perl not only supports non-Unicode utf8-ish characters up to
2**32, but it even has an encoding for 64 bits (not sure if regexen
handle the latter, though).  Purists inclined to complain should note
that this is orthogonal to the acceptance of illegal characters in I/O.
(To avoid confusion, we don't call our encoding UTF-8.  We tend to
say UTF-8 when we mean UTF-8, and "utf8" when we mean the more general
not-necessarily-Unicode encoding.

Larry

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Perl & unicode weirdness.

Reply via email to