Tom Allison schreef: > Ruud: >> Tom: >>> Under perl version 5.8, does /(\w+)/ match UTF-8 characters without >>> calling any special pragma? >> >> Yes, but only if your data is proper. Mind that any ASCII-character >> is a UTF-8 character too (U+0000 .. U+007F). >
>>> So I'm trying to see if I can just use /(\w+)/ without worrying >>> about all this character encoding? >> >> Only if your data is proper. A file is just a string of bytes. If you >> use the proper IO-layer while reading in the file, then you'll end up >> with proper data (a string of characters, not of bytes) to work with. >> >> A UTF-8 encoded file can't tell you that it is UTF-8 encoded. For >> example a UTF-8 BOM at the start (as Windows Notepad uses) is not >> proof. So you need to know beforehand. > > Rather than going through the somewhat buggy process of trying to > determine which of the > many character sets there are, is there some way that I can just > universally convert everything into UTF8? Yes, iconv, but still only if you know the encoding of the source, see my "beforehand". See also `man iconv`. > I can open a file with a :utf8 declaration when creating the file > handle. Don't do that unless the contents of the file (that you open for reading) are in utf8. > But do I need to do this on a utf8 file or will perl just > "know". If it doesn't, can I just open everything in utf8 mode and > not lose any data? No, see again my "beforehand". A utf8 file (I assume you mean "Perl utf8") is still just a stream of bytes. http://search.cpan.org/~dankogai/Encode-2.21/lib/Encode/Guess.pm Did you read perlunitut? -- Affijn, Ruud "Gewoon is een tijger." -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/