Re: Non-ASCII data in POD
Grant McLean gr...@mclean.net.nz writes: OK, so I went ahead and implemented both the warning and the heuristic to guess Latin-1 vs UTF-8 (only when no encoding was specified). The resulting patch is here: https://github.com/theory/pod-simple/pull/26 This patch enforces authors to add an =encoding UTF-8 line to specify that the doc is, indeed, UTF-8 encoded. Wouldn't it be far better to consider all POD documents to be Utf-8 encoded Unicode and fall back to Latin1 if invalid UTF-8 sequences are detected? In other words, do not enforce the author to add =encoding UTF-8 since that's the default? And only add =encoding ISO8859-1 for Latin1 encoded documents? Since most POD documents currently are ASCII, they won't be affected. POD docs that are Latin1 or something similar must get an explicit encoding line added. These are precisely the documents affected by your patch. -- Johan
Re: Non-ASCII data in POD
On Mon, 2012-04-30 at 14:24 +0200, Johan Vromans wrote: Grant McLean gr...@mclean.net.nz writes: OK, so I went ahead and implemented both the warning and the heuristic to guess Latin-1 vs UTF-8 (only when no encoding was specified). The resulting patch is here: https://github.com/theory/pod-simple/pull/26 This patch enforces authors to add an =encoding UTF-8 line to specify that the doc is, indeed, UTF-8 encoded. Not exactly. It generates a warning during the parsing process which will be visible in the output of any formatter that has error output enabled. It's not a fatal error so it doesn't exactly enforce anything. The aim is to help people comply with the spec for POD as it is currently written. And that spec says that if there are non-ASCII characters there must be an =encoding declaration. Wouldn't it be far better to consider all POD documents to be Utf-8 encoded Unicode and fall back to Latin1 if invalid UTF-8 sequences are detected? You won't get any argument from me that UTF-8 would be a better default, but that's not how the spec is currently written. If your Perl source code includes UTF-8 characters, you must say: use utf8; If your POD includes UTF-8 characters, you must say: =encoding utf8 In other words, do not enforce the author to add =encoding UTF-8 since that's the default? And only add =encoding ISO8859-1 for Latin1 encoded documents? The patch does also implement the heuristic recommended in the perlpodspec which has the effect of allowing either Latin-1 or UTF-8 to work (the default is ASCII) in spite of the missing declaration. This will be a win for sites like metacpan.org which currently don't display UTF-8 correctly from POD that lacks an =encoding declaration. Any formatter that has error display disabled will see better rendering of UTF-8 with this patch. Additionally, if errors are displayed, the non-compliance with perlpodspec will be reported. Regards Grant
Re: Non-ASCII data in POD
On Apr 27, 2012, at 12:10 AM, Grant McLean wrote: OK, so I went ahead and implemented both the warning and the heuristic to guess Latin-1 vs UTF-8 (only when no encoding was specified). The resulting patch is here: https://github.com/theory/pod-simple/pull/26 I like this, but wonder if maybe it shouldn't be consistent? That is, if you see more than one of these in a single document, and one can be output as UTF-8 and the other can’t, would the resulting output have mixed encodings? IOW, should it not perhaps use the encoding it determined for the first one of these it finds in a document? Best, David
Re: Non-ASCII data in POD
On Fri, 2012-04-27 at 09:17 -0700, David E. Wheeler wrote: On Apr 27, 2012, at 12:10 AM, Grant McLean wrote: OK, so I went ahead and implemented both the warning and the heuristic to guess Latin-1 vs UTF-8 (only when no encoding was specified). The resulting patch is here: https://github.com/theory/pod-simple/pull/26 I like this, but wonder if maybe it shouldn't be consistent? That is, if you see more than one of these in a single document, and one can be output as UTF-8 and the other can’t, would the resulting output have mixed encodings? IOW, should it not perhaps use the encoding it determined for the first one of these it finds in a document? I'm not sure I quite understand what you're saying. The first time a non-ASCII byte is encountered, the code will 'fire' and apply the heuristic to set an encoding. Once the encoding is set, the code won't be called again. The perlpodspec seems pretty clear that a POD document containing different encodings should be considered an error. Regards Grant
Re: Non-ASCII data in POD
On Apr 27, 2012, at 12:54 PM, Grant McLean wrote: I'm not sure I quite understand what you're saying. The first time a non-ASCII byte is encountered, the code will 'fire' and apply the heuristic to set an encoding. Once the encoding is set, the code won't be called again. Oh, perfect. I missed that. The perlpodspec seems pretty clear that a POD document containing different encodings should be considered an error. As it should be. David
Re: Non-ASCII data in POD
On 04/25/2012 09:25 PM, Russ Allbery wrote: Grant McLeangr...@mclean.net.nz writes: My thoughts on the second issue are that we could modify Pod::Simple to 'whine' if it sees non-ASCII bytes but no =encoding. This in turn would cause Test::Pod to pick up the error and help people fix it. I would be in favor of that. FYI, This test is already in the checks that are run on the pods that are included with the perl core
Non-ASCII data in POD
Hi POD people There's been a discussion on #metacpan about non-ASCII characters in POD being rendered incorrectly on the metacpan.org web site. The short story is that some people use utf8 characters without including: =encoding utf8. Apparently the metacpan tool chain assumes latin1 encoding, but with the right encoding declaration, the characters would be rendered correctly. The latest perlpodspec seems to imply an ASCII default and anything else should have an =encoding. In the implementation notes section it also suggests a heuristic of checking whether the first highbit byte-sequence is valid as UTF-8 and default to UTF-8 if so and Latin-1 otherwise. This raises two issues: 1) Pod::Simple (as used by metacpan) does not seem to implement this heuristic 2) We need to educate people who are not aware of the =encoding command My thoughts on the second issue are that we could modify Pod::Simple to 'whine' if it sees non-ASCII bytes but no =encoding. This in turn would cause Test::Pod to pick up the error and help people fix it. I'd be happy to look at implementing both these things if it's agreed they're a good idea. Regards Grant