Re: Non-ASCII data in POD

2012-05-03 Thread Johan Vromans
Grant McLean gr...@mclean.net.nz writes:

 OK, so I went ahead and implemented both the warning and the heuristic
 to guess Latin-1 vs UTF-8 (only when no encoding was specified).  The
 resulting patch is here:

   https://github.com/theory/pod-simple/pull/26

This patch enforces authors to add an =encoding UTF-8 line to
specify that the doc is, indeed, UTF-8 encoded.

Wouldn't it be far better to consider all POD documents to be Utf-8
encoded Unicode and fall back to Latin1 if invalid UTF-8 sequences are
detected? In other words, do not enforce the author to add =encoding
UTF-8 since that's the default? And only add =encoding ISO8859-1 for
Latin1 encoded documents?

Since most POD documents currently are ASCII, they won't be affected.

POD docs that are Latin1 or something similar must get an explicit
encoding line added. These are precisely the documents affected by your
patch.

-- Johan


Re: Non-ASCII data in POD

2012-05-03 Thread Grant McLean
On Mon, 2012-04-30 at 14:24 +0200, Johan Vromans wrote:
 Grant McLean gr...@mclean.net.nz writes:
 
  OK, so I went ahead and implemented both the warning and the heuristic
  to guess Latin-1 vs UTF-8 (only when no encoding was specified).  The
  resulting patch is here:
 
https://github.com/theory/pod-simple/pull/26
 
 This patch enforces authors to add an =encoding UTF-8 line to
 specify that the doc is, indeed, UTF-8 encoded.

Not exactly.  It generates a warning during the parsing process which
will be visible in the output of any formatter that has error output
enabled.  It's not a fatal error so it doesn't exactly enforce
anything.

The aim is to help people comply with the spec for POD as it is
currently written.  And that spec says that if there are non-ASCII
characters there must be an =encoding declaration.

 Wouldn't it be far better to consider all POD documents to be Utf-8
 encoded Unicode and fall back to Latin1 if invalid UTF-8 sequences are
 detected?

You won't get any argument from me that UTF-8 would be a better default,
but that's not how the spec is currently written.

If your Perl source code includes UTF-8 characters, you must say:

  use utf8;

If your POD includes UTF-8 characters, you must say:

  =encoding utf8

 In other words, do not enforce the author to add =encoding
 UTF-8 since that's the default? And only add =encoding ISO8859-1 for
 Latin1 encoded documents?

The patch does also implement the heuristic recommended in the
perlpodspec which has the effect of allowing either Latin-1 or UTF-8 to
work (the default is ASCII) in spite of the missing declaration.  This
will be a win for sites like metacpan.org which currently don't display
UTF-8 correctly from POD that lacks an =encoding declaration.

Any formatter that has error display disabled will see better rendering
of UTF-8 with this patch.

Additionally, if errors are displayed, the non-compliance with
perlpodspec will be reported.

Regards
Grant




Re: Non-ASCII data in POD

2012-04-27 Thread David E. Wheeler
On Apr 27, 2012, at 12:10 AM, Grant McLean wrote:

 OK, so I went ahead and implemented both the warning and the heuristic
 to guess Latin-1 vs UTF-8 (only when no encoding was specified).  The
 resulting patch is here:
 
  https://github.com/theory/pod-simple/pull/26

I like this, but wonder if maybe it shouldn't be consistent? That is, if you 
see more than one of these in a single document, and one can be output as UTF-8 
and the other can’t, would the resulting output have mixed encodings? IOW, 
should it not perhaps use the encoding it determined for the first one of these 
it finds in a document?

Best,

David



Re: Non-ASCII data in POD

2012-04-27 Thread Grant McLean
On Fri, 2012-04-27 at 09:17 -0700, David E. Wheeler wrote:
 On Apr 27, 2012, at 12:10 AM, Grant McLean wrote:
 
  OK, so I went ahead and implemented both the warning and the heuristic
  to guess Latin-1 vs UTF-8 (only when no encoding was specified).  The
  resulting patch is here:
  
   https://github.com/theory/pod-simple/pull/26
 
 I like this, but wonder if maybe it shouldn't be consistent? That is,
 if you see more than one of these in a single document, and one can be
 output as UTF-8 and the other can’t, would the resulting output have
 mixed encodings? IOW, should it not perhaps use the encoding it
 determined for the first one of these it finds in a document?

I'm not sure I quite understand what you're saying.  The first time a
non-ASCII byte is encountered, the code will 'fire' and apply the
heuristic to set an encoding.  Once the encoding is set, the code won't
be called again.

The perlpodspec seems pretty clear that a POD document containing
different encodings should be considered an error.

Regards
Grant



Re: Non-ASCII data in POD

2012-04-27 Thread David E. Wheeler
On Apr 27, 2012, at 12:54 PM, Grant McLean wrote:

 I'm not sure I quite understand what you're saying.  The first time a
 non-ASCII byte is encountered, the code will 'fire' and apply the
 heuristic to set an encoding.  Once the encoding is set, the code won't
 be called again.

Oh, perfect. I missed that.

 The perlpodspec seems pretty clear that a POD document containing
 different encodings should be considered an error.

As it should be.

David



Re: Non-ASCII data in POD

2012-04-26 Thread Karl Williamson

On 04/25/2012 09:25 PM, Russ Allbery wrote:

Grant McLeangr...@mclean.net.nz  writes:


My thoughts on the second issue are that we could modify Pod::Simple to
'whine' if it sees non-ASCII bytes but no =encoding.  This in turn would
cause Test::Pod to pick up the error and help people fix it.


I would be in favor of that.



FYI, This test is already in the checks that are run on the pods that 
are included with the perl core


Non-ASCII data in POD

2012-04-25 Thread Grant McLean
Hi POD people

There's been a discussion on #metacpan about non-ASCII characters in POD
being rendered incorrectly on the metacpan.org web site.

The short story is that some people use utf8 characters without
including: =encoding utf8.  Apparently the metacpan tool chain assumes
latin1 encoding, but with the right encoding declaration, the characters
would be rendered correctly.

The latest perlpodspec seems to imply an ASCII default and anything else
should have an =encoding.  In the implementation notes section it also
suggests a heuristic of checking whether the first highbit byte-sequence
is valid as UTF-8 and default to UTF-8 if so and Latin-1 otherwise.

This raises two issues:

1) Pod::Simple (as used by metacpan) does not seem to implement this
   heuristic
2) We need to educate people who are not aware of the =encoding command

My thoughts on the second issue are that we could modify Pod::Simple to
'whine' if it sees non-ASCII bytes but no =encoding.  This in turn would
cause Test::Pod to pick up the error and help people fix it.

I'd be happy to look at implementing both these things if it's agreed
they're a good idea.

Regards
Grant