On Mon, Feb 02, 2004 at 12:09:07PM -0800, Larry Wall wrote:
> On Sat, Jan 31, 2004 at 02:07:07PM +0000, Markus Kuhn wrote:
> : Question: What is a quick way in Perl to get a regular expression that
> : matches all Unicode characters in the range U0100..U10FFFF, in other
> : words all non-ASCII Unicode characters?
>
> Er, you mean U0080..U10FFFF perchance? Or did you mean non-latin1?
>
> These all seem to work for me in 5.8.1:
>
> print if /[^[:ascii:]]/;
> print if /[^\0-\177]/;
> print if /[\x{80}-\x{10fff}]/;
>
> Those will, of course, differ on their interpretation of characters
> above U10FFFF. Perl programmers are allowed to think bad thoughts like
> that...Perl not only supports non-Unicode utf8-ish characters up to
> 2**32, but it even has an encoding for 64 bits (not sure if regexen
> handle the latter, though). Purists inclined to complain should note
> that this is orthogonal to the acceptance of illegal characters in I/O.
> (To avoid confusion, we don't call our encoding UTF-8. We tend to
> say UTF-8 when we mean UTF-8, and "utf8" when we mean the more general
What do you then call standard ISO 10646 UTF-8?
Is this utf-8 or utf8?
should one use the names unicode-utf-8 and iso-utf-8?
best regards
keld
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/