The Encoding Warning (Again)

2012-08-27 Thread David E. Wheeler
Pod People,

In https://rt.cpan.org/Ticket/Display.html?id=79232, Saven Rezic writes:

 Pod::Simple currently (e.g. with version 3.23) complains if a Pod
 document has latin-1 characters in it but no =encoding command
 specified. I think this is incorrect, both perlpod.pod and
 perlpodspec.pod specify that a document without =encoding command is in
 latin-1:
 
 perlpod.pod (as of commit 684c7e375f581ccd114b4c6b4e8ea730402b50f3 in
 perl):
 
 =encoding encoding
 
 ... Most users won't need this; but if your encoding isn't US-ASCII
 or Latin-1 ...
 
 perlpodspec.pod (as of commit c85e9b4c9684bc896847f5a80e9e91b478c2fc59
 in perl):
 
 ... Otherwise, the character encoding should be understood as being
 UTF-8 if the first highbit byte sequence in the file seems valid as
 a UTF-8 sequence, or otherwise as Latin-1.
 
 (Unfortunately perlpodspec.pod isn't quite clear about this when
 explaining encoding; the quoted paragraph is from the Notes on
 Implementing Pod Processors section.

None of this says anything about warnings, and it does seem that, as of 3.23, 
this is exactly how Pod::Simple behaves (or will once 
https://github.com/theory/pod-simple/pull/40 is merged).

Thoughts?

David



Re: The Encoding Warning (Again) - some data

2012-08-27 Thread Grant McLean
On Mon, 2012-08-27 at 10:17 -0700, David E. Wheeler wrote:
 Pod People,
 
 In https://rt.cpan.org/Ticket/Display.html?id=79232, Saven Rezic
writes:
 
  Pod::Simple currently (e.g. with version 3.23) complains if a Pod
  document has latin-1 characters in it but no =encoding command
  specified. I think this is incorrect, both perlpod.pod and
  perlpodspec.pod specify that a document without =encoding command is
in
  latin-1:

When I kicked this process off in April the issue I was trying to fix
was that UTF-8 documents did not render correctly on metacpan.org.  I
proposed two changes: implementing the encoding heuristic and adding the
warning.  There was a small amount of discussion and both proposals were
considered sane.

  http://www.nntp.perl.org/group/perl.pod-people/2012/04/msg1789.html

At the time I had no data on how many distributions were affected (I
only knew I saw mangled characters quite frequently).  Now that the
patch is in and generating the warning, I am able to get that data.  So
today I rendered all the POD from all current distributions in my
minicpan and collected stats on how often the warning was generated.

From a total 5157 distributions, files in 1215 distributions generated
the warning (i.e.: contained non-ASCII characters in POD with no
=encoding declaration).

The split was roughly 50-50 with 1187 files being detected as Latin-1
and 1131 as UTF-8.

So there are current 1131 files which are now able to render correctly
on metacpan.org which was my goal.

There are also 1187 files (from 570 distributions) which rendered
perfectly fine before but will now include the new warning in places
where rendering of parser errors is enabled.  This is approximately 11%
of current releases on CPAN - probably a higher number than I would have
anticipated.

Some portion of that 11% will be using Test::Pod and will now have a
test failure where none existed before.  (Sorry I don't have the
statistics on what proportion use Test::Pod and don't limit it to
'author' tests).

So if anyone's opinion is likely to be swayed by data - there's some
data.


My opinion is that the warning is useful.  However to be pragmatic, now
that the encoding detection heuristic has been implemented, adding
=encoding declarations to any of those 1215 distributions will have no
practical effect other than silencing the warning.  That's the best
argument I can come up with in favour of changing the status quo.

If we decided to turn off that warning then I would like to see an
option to allow people to turn it back on if they want.


On a tangentially related note, I was pondering whether the heuristic
should actually fall back to CP1252 rather than ISO8859-1 - after all
that's what the W3C recommend:

  http://www.w3.org/TR/html5/parsing.html#character-encodings-0

However my statistics show that only 44 files in current releases were
detected as Latin-1 but actually contained CP1252 (typically smart
quote symbols in the \x80-\x9F range).  So it doesn't seem worth
pursuing that change.

Finally I searched for files which were detected as UTF-8 but actually
contained characters from the CP1252 range.  There was only one and it
wasn't an error in the detection, the source file contains a
double-encoded character. It was a mangled attempt to name a contributor
- Slaven Resić   :-)

Regards
Grant