On Mon, Feb 02, 2004 at 12:09:07PM -0800, Larry Wall wrote:
> On Sat, Jan 31, 2004 at 02:07:07PM +0000, Markus Kuhn wrote:
> : Question: What is a quick way in Perl to get a regular expression that
> : matches all Unicode characters in the range U0100..U10FFFF, in other
> : words all non-ASCII Unicode characters?
> 
> Er, you mean U0080..U10FFFF perchance?  Or did you mean non-latin1?
> 
> These all seem to work for me in 5.8.1:
> 
>     print if /[^[:ascii:]]/;
>     print if /[^\0-\177]/;
>     print if /[\x{80}-\x{10fff}]/;
> 
> Those will, of course, differ on their interpretation of characters
> above U10FFFF. Perl programmers are allowed to think bad thoughts like
> that...Perl not only supports non-Unicode utf8-ish characters up to
> 2**32, but it even has an encoding for 64 bits (not sure if regexen
> handle the latter, though).  Purists inclined to complain should note
> that this is orthogonal to the acceptance of illegal characters in I/O.
> (To avoid confusion, we don't call our encoding UTF-8.  We tend to
> say UTF-8 when we mean UTF-8, and "utf8" when we mean the more general

What do you then call standard ISO 10646 UTF-8?
Is this utf-8 or utf8?

should one use the names unicode-utf-8 and iso-utf-8?

best regards
keld

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to