POD as Latin-1 or Unicode

Sean M. Burke Tue, 17 Apr 2001 13:35:27 -0700
BTW, sorry about the previous post going out twice -- my mailer freaked in
the middle of its SMTP session.


Here's a problem I've run up against in testing Pod::PXML by round-tripping
existing POD: what character-encoding is POD in?

If we say that all POD is in UTF8 (and interpret it all as such), then a
few POD bits I've found that have Latin-1 characters in them (as literal
é's, not as E<...> things) cause the screaming of bloody murder ("malformed
UTF8 character...").

If we say that all POD is in Latin-1 (and interpret it all as such), then
it's insufficient for expressing Perl code, since Perl code can contain
Unicode characters.  So if you define a sub whose name is a string of three
Han glyphs, then you just can't show it off in a verbatim block.  (Outside
of verbatim blocks this isn't a problem, since we can just say "let's all
use E<...>'s for non-USASCII characters.)


What I'm leaning toward is to assume that all POD is in /either/ UTF8 or
Latin-1 (or US-ASCII, in which case the difference is moot), and that one
should start out treating it as Latin-1, scan all clusters of
([\x80-\xFF]+) to see if they look like UTF8, and if they all do, then put
the whammy on the text so that it's magically to be considered thenceforth
as Unicode.

Now, I can't believe I'm the first person to have faced this problem, since
this is hardly POD-specific.  But a scan of relevent CPAN/core modules just
leaves me with a screaming headache because of all the levels of
representation and the apparent mix of things predating and postdating
utf8.pm.  And I suppose I'm looking for something that plays nice with utf8.
Is there some magical incantation for doing what I mean?


(BTW, there's an alternative to construing a /whole/ document as being
either in UTF8 or in something bytewise: consider every individual bunch of
high-bit chars independently: if a cluster is valid utf8, construe it as
such, otherwise construe it as Latin-1 or the like.
However, I don't want to be the one to suggest letting /that/ particular
genie out of the bottle!)


Now, there's a converse problem when emitting text as POD: encode it as
UTF8, or as Latin-1?  (Remember, the difference currently arises /only/
when there's high-bit content in verbatim blocks -- everywhere else, you
can use a E<...>.
I'm leaning toward "verbatims always come out in UTF8", for sake of
uniformity.  Be permissive in what you accept, strict in what you emit,
etc. etc.


--
Sean M. Burke  [EMAIL PROTECTED]  http://www.spinn.net/~sburke/
POD as Latin-1 or Unicode

Reply via email to