I have just read the utf8(3pm) man page as it comes with perl v5.8.8 and I'm afraid, I found it *very* confusing and well below the generally very high standards of clarity found in most of the Perl documentation.
There is a wild mixture of terminology that is never properly defined anywhere. For example, no clear distinction is made whether "a string is in UTF-8" means that the UTF-8 flag has been set (character semantics versus byte semantics), or whether the string's internal representation does not contain any malformed UTF-8 byte sequences, or both, or neither. Basically, I have not understood without great doubt and uncertainty what any of the "utility functions" described really do, that is whether they only affect the byte/character flag of a string or whether (and under which conditions exactly) they also change the byte sequence itself. There are a number of applications in which a Perl developper is continuously dealing with a mixture of both byte and character sequences, and these will not go away. Think about binary file formats or machine code (byte sequences) that contains embedded UTF-8 strings (character sequences) that each need to be treated as such, but that also need to be concatenated or separated in various ways. Or think about Perl code that robustly searches for and prints diagnostics about malformed UTF-8 sequences. In such applications, the low-level control over the byte-versus-character nature of a Perl string that the utf8:: functions provide is extremely important, and a clearer writeup of what exactly they do would be very helpful. Given how important these functions are for such applications, the many references to "this may change in the future" are also adding a lot of fear, uncertainty and doubt to anyone who wants to use them. :-( Example: Utility functions The following functions are defined in the "utf8::" package by the Perl core. You do not need to say "use utf8" to use these and in fact you should not say that unless you really want to have UTF-8 source code. * $num_octets = utf8::upgrade($string) Converts in-place the octet sequence in the native encoding (Latin-1 or EBCDIC) to the equivalent character sequence in UTF-X. [What exactly is meant by "native encoding" these days?] $string already encoded as characters does no harm. [What does "no harm" mean exactly?] Returns the number of octets necessary to represent the string as UTF-X. [Examples of all the major cases how this function can behave?] Can be used to make sure that the UTF-8 flag is on [is that all it does?], so that "\w" or "lc()" work as Unicode [?] on strings containing [UTF-8?] characters in the range 0x80-0xFF (on ASCII and derivatives [?]). Note that this function does not handle arbitrary encodings. [Which cases does it handle?] Therefore Encode.pm is recommended for the general purposes. [Example?] Affected by the encoding pragma. [How?] * $success = utf8::downgrade($string[, FAIL_OK]) Converts in-place the character sequence in UTF-X to the equivalent octet sequence in the native encoding (Latin-1 or EBCDIC). $string already encoded as octets does no harm. Returns true on success. On failure dies or, if the value of "FAIL_OK" is true, returns false. Can be used to make sure that the UTF-8 flag is off, e.g. when you want to make sure that the substr() or length() function works with the usually faster byte algorithm. Note that this function does not handle arbitrary encodings. Therefore Encode.pm is recommended for the general purposes. [Same problems as above] Not affected by the encoding pragma. NOTE: this function is experimental and may change or be removed without notice. [:-(] * utf8::encode($string) Converts in-place the character sequence to the corresponding octet sequence in UTF-X. The UTF-8 flag is turned off. Returns nothing. [Does this mean, that the byte sequence is never touched and all this function does is to turn off the UTF-8 flag?] Note that this function does not handle arbitrary encodings. Therefore Encode.pm is recommended for the general purposes [?]. * utf8::decode($string) Attempts to convert in-place the octet sequence in UTF-X to the corresponding character sequence. The UTF-8 flag is turned on only if the source string contains multiple-byte UTF-X characters. If $string is invalid as UTF-X, returns false; otherwise returns true. Note that this function does not handle arbitrary encodings. Therefore Encode.pm is recommended for the general purposes. NOTE: this function is experimental and may change or be removed without notice. [:-( why?] * $flag = utf8::is_utf8(STRING) (Since Perl 5.8.1) Test whether STRING is in UTF-8. Functionally the same as Encode::is_utf8(). [Does this just return the UTF-8 flag, or does it test the string, and if the latter, against what exact regexp?] * $flag = utf8::valid(STRING) [INTERNAL] Test whether STRING is in a consistent state regarding UTF-8. [What exactly does this mean?] Will return true is [sic!] well-formed UTF-8 and has the UTF-8 flag on or if string is held as bytes (both these states are 'con- sistent'). Main reason for this routine is to allow Perl's test- suite to check that operations have left strings in a consistent state. You most probably want to use utf8::is_utf8() instead. "utf8::encode" is like "utf8::upgrade", but the UTF8 flag is cleared. [Also, one required a character sequence, the other an octet sequence!] See perlunicode for more on the UTF8 flag and the C API functions "sv_utf8_upgrade", "sv_utf8_downgrade", "sv_utf8_encode", and "sv_utf8_decode", which are wrapped by the Perl functions "utf8::upgrade", "utf8::downgrade", "utf8::encode" and "utf8::decode". Note that in the Perl 5.8.0 and 5.8.1 implementation the functions utf8::is_utf8, utf8::valid, utf8::encode, utf8::decode, utf8::upgrade, and utf8::downgrade are always available, without a "require utf8" statement-- this may change in future releases. It would be great if there is any expert here who really understands this API and who could clarify the writing somewhat. Some other parts of the Perl Unicode documentation are also not yet shining examples of clear writing. Thanks! Markus -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain