At 23:58 2002-11-13 -0500, Benjamin Goldberg wrote:
Here's an important question: perldoc perlpodspec says that the encoding is either latin1 if there's no BOM, and UTF-(size) if there is a BOM. What if we want a file with a different encoding? In particular, people writing text with accented characters often want latin1 or latin2, since that may what their editor supports.
Actually, I said:

<<
Since Perl recognizes a Unicode Byte Order Mark at the start of files as signaling that the file is Unicode encoded as in UTF-16 (whether big-endian or little-endian) or UTF-8, Pod parsers should do the same. Otherwise, the character encoding should be understood as being UTF-8 if the first highbit byte sequence in the file seems valid as a UTF-8 sequence, or otherwise as Latin-1.

Future versions of this specification may specify how Pod can accept other encodings. Presumably treatment of other encodings in Pod parsing would be as in XML parsing: whatever the encoding declared by a particular Pod file, content is to be stored in memory as Unicode characters.
>>


Two important facts: these are "should", not "must"; and, probably less importantly, in the absence of a BOM, the first highbit sequence should trigger auto-guessing.

But anyway, I made the above a bit nebulous because:
-- I was waiting for something like Encode
-- I was waiting for perl to support (and auto-guess) encodings on non-binary finehandles AND perl source.
-- And, more optimistically, I was hoping that people who wanted to edit Pod and/or perl source in East Kreplakhistani or whatever doesn't fit in Latin-1, would do it in Unicode.

Have at least two of these things happened yet?

I have an ill feeling about implementing this stuff in Pod::Simple, since there's very little that's Pod-specific about wanting text/source to be parsed as being in the correct encoding, especially as signaled by a BOM.

--
Sean M. Burke http://search.cpan.org/author/sburke/



Reply via email to