Re: regex & utf8

Dr.Ruud Sat, 12 May 2007 06:10:14 -0700

Tom Allison schreef:
> Ruud:
>> Tom:

>>> Under perl version 5.8, does /(\w+)/ match UTF-8 characters without
>>> calling any special pragma?
>>
>> Yes, but only if your data is proper. Mind that any ASCII-character
>> is a UTF-8 character too (U+0000 .. U+007F).
>


>>> So I'm trying to see if I can just use /(\w+)/ without worrying
>>> about all this character encoding?
>>
>> Only if your data is proper. A file is just a string of bytes. If you
>> use the proper IO-layer while reading in the file, then you'll end up
>> with proper data (a string of characters, not of bytes) to work with.
>>
>> A UTF-8 encoded file can't tell you that it is UTF-8 encoded. For
>> example a UTF-8 BOM at the start (as Windows Notepad uses) is not
>> proof. So you need to know beforehand.
>
> Rather than going through the somewhat buggy process of trying to
> determine which of the
> many character sets there are, is there some way that I can just
> universally convert everything into UTF8?

Yes, iconv, but still only if you know the encoding of the source, see
my "beforehand".
See also `man iconv`.


> I can open a file with a :utf8 declaration when creating the file
> handle.

Don't do that unless the contents of the file (that you open for
reading) are in utf8.


> But do I need to do this on a utf8 file or will perl just
> "know".  If it doesn't, can I just open everything in utf8 mode and
> not lose any data?

No, see again my "beforehand". A utf8 file (I assume you mean "Perl
utf8") is still just a stream of bytes.
http://search.cpan.org/~dankogai/Encode-2.21/lib/Encode/Guess.pm

Did you read perlunitut?

-- 
Affijn, Ruud

"Gewoon is een tijger."


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/

Re: regex & utf8

Reply via email to