Thanks to both David E. Wheeler and David Graff for their answers.
I have now uploaded my first distro that takes advantage of them, Locale-KeyedText-1.01_1.tar.gz, which requires 5.008001.
All files in my distro are officially UTF-8, without a BOM, and with unix line breaks; this is identical to ASCII except where non-ASCII characters are used. All code files "use utf8" and all files having pod use "=encoding utf8".
I tested this distro and it passes its test suite fine, both under Perl 5.8.1RC3 and Perl 5.8.6. However, under both the 'make' stage gives the following error when manifying the POD:
Manifying blib/man3/Locale::KeyedText.3 lib/Locale/KeyedText.pm:10: Unknown command paragraph "=encoding utf8"
Now, the line that is being flagged is identical to both the online perlpod documentation and the example given by David Wheeler. So what is the problem here?
Besides CPAN, the file is also availble here: http://darrenduncan.net/d/perl/Locale-KeyedText-1.01_1.tar.gz
Thanks for any feedback on the error or other matters.
-- Darren Duncan
At 11:07 PM -0800 12/12/04, David E. Wheeler wrote (on-list):
On Dec 12, 2004, at 10:06 PM, Darren Duncan wrote:
What I would like to do is create my CPAN module distributions such that all of the files in each distro, code and documentation and tests and logs alike, are properly UTF-8 encoded, and do this in such a way that no modern Perl distributions or the automated CPAN tools will break.
Short answer:
use utf8;
=pod
=encoding utf8
=cut
Regards,
David
At 1:20 PM -0500 12/13/04, David Graff wrote (off-list):
[EMAIL PROTECTED] said:0. For my main question, is distribution as Unicode files a good idea at all currently, though few if any people do it?
It's a good idea if you need to include text/character data that fall
outside the ASCII range (e.g. pod in languages other than English, etc). Otherwise, since ASCII is a proper subset of utf8, every ASCII-only distro
is, by definition, a utf8 distro.
1. BBEdit gives me an option to have a byte-order mark in UTF-8 files (that happens to be 3 octets long I think), with the recommendation being to use it; I also have the choice not to, which makes the file more similar to many other ASCII-like encodings. So should I save the files with the BOM or without?
The BOM comes in very handy for UTF-16 data, and I suppose there may be some apps that will check the top of a file for a BOM (as LE, BE, or the three-byte utf8 pattern) in order to "predict" that the file contains the corresponding sort of unicode data. But Perl is not one of those apps, nor are any of the tools that are normally used to install Perl modules. Since you're not ever using UTF-16, and module recipients won't be either, the BOM will just get in the way. Leave it out.
2. I am given a separate option to use either Unicode linebreaks or one of Unix/Mac/Win; all 4 are given as options to use with a Unicode encoding. In my own tests, Perl 5.8 complained when the Unicode line break was used with UTF8, but not the Unix line break (I was not, however, using any special pragmas). So should I use the Unicode linebreak or the Unix linebreak, assuming the former can be made to work?
2.1 Will the addition of "use utf8" on the first line of a Perl file cause Perl to accept files with Unicode line breaks?
I can't imagine what a "unicode linebreak" in utf8 would be. Does BBEdit indicate a code point for this? In any case, I'd stick with the unix line breaks (simply \xA), because all the tools normally used to install modules will recognize and handle this correctly.
3. Can a "use utf8" be put anywhere besides the first line of a file? What if I customarily put POD on the first few lines and the package declaration beneath it? Also, in a script file, which goes first, the #!perl or the use-utf8?
The shebang line goes first, always -- in fact, this is a good reason to forget about including the BOM in your distro files. For unix shells to use the shebang line properly, the two characters "#!" must be the first ones in the file.
4. What about plain POD files? Since they contain no POD, will POD extractors know what to do since I can't put the use-utf8 in them?
The interpretation of characters in pod will depend on the display
mechanism being used by the person who runs perldoc. If the pod includes
utf8 text, and the person runs perldoc in a utf8-capable window (with the
appropriate font(s) available), everything should go just fine, maybe.
Best to just try it out and see what happens. Results might depend on environmental things like locale setting, etc. Since there are tools that will convert pod to html, etc, it would be worthwhile to see how these work when the pod contains utf8. Again, try it and see.
5. Would the CPAN compare utility adapt to encoding changes, or would it consider an otherwise-identical file with different encodings to consist of one very large change?
Who said anything about having the same module posted on CPAN with different encodings? Why do that? I wouldn't expect CPAN diff tools to handle this sort of case by trying to factor out encoding differences -- if there ever is a good reason to post the same module with different encodings, then it's likely the different versions should be treated as different.
6. In general, would anything on CPAN break? What about the automated testers?
7. Are there any other common issues that I should be aware of, and if so then what?
In general, CPAN is just a repository of tar files. What's to break? Try it out and see.
David Graff