Hi all!
I'm still on this pod-people list, but I very rarely read it-- and usually just a glance at the subject lines. But today I looked, saw i18n stuff going on, and I gave things a quick skim. Well,...


First-and-foremost: perlpodspec is no longer mine.

But I can weigh in on my original intent in bits and pieces of it, for whatever that is worth. My intent was: For files with no encoding declarations, when the parser hits 8-bit stuff, it should politely DWIM when it's sensible and possible.

And if I were writing the spec over again, I would spell that out: if that heuristic is applied and it is unproblematic (i.e., not contradicted later on, nor contradicting a BOM), then keep quiet.


I know we should hold CPAN module authors to a high standard of dotting their i's and crossing their t's ...and declaring their encodings.

But I think we should also hold pod parsers to a high standard of keeping quiet when simple heuristics are unproblematically applied.

(Pod parsers already do some silent heuristics,... well, I put the "heuristics" into the spec itself, just to make sure that they would get universally and silently implemented. Example: numeric item points, "=item 3", are actually anything that matches
  m/\A=item\s+\d+\.?\s*\z/
...That's a moderately forgiving regexp.)

And now I consider that perl itself is *thick* with silent heuristics (such as: the pattern /goo$/ has nothing to do with the variable $/ ...to say nothing of what z/8/g can do, depending on prototypes). And considering those things, I would change a "should" to a "must", and I would say:

A pod parser must obey that encoding-guessing heuristic, and it must do so silently--

* Unless the result of the heuristic is contradicted by a different kind of byte sequence later, or by a contradictory "=encoding" line. (In that case, yes, complain!)

* Or unless the parser is asked to emit a warning (otherwise: stay quiet).
And the warning should be asked for probably just in the situation of the code being parsed as part of some "critique my code in EVERY way possible!" program-- not in the context of a user simply trying to install a module. (Did I say *a* module? Worse situation: when the user is simply trying to install one module, then the installer reports that that module has 5 dependencies, hey let's install them all, but dependency #3 is the one causing the discomfortingly baffling "Highbit but no =encoding!?" warning to fly by.)

Users' and authors' attention is better spent elsewhere than chasing why a previously all-OK module now throws warnings because Garcia changed into García. (And that difference might be invisible in our fun scrunched up programmerese fonts!)


But the above was just my intent. Since then, there's been a decade for people to find trouble with possible "utf8 or latin1?" heuristics, or to see complications that need dodging,...
So it's up to you all, and I won't debate anyone's conclusions.


Oh, and I hope I can help here:

On 08/27/2012 10:03 PM, Grant McLean wrote:
On a tangentially related note, I was pondering whether the heuristic
should actually fall back to CP1252 rather than ISO8859-1 - after all
that's what the W3C recommend:

   http://www.w3.org/TR/html5/parsing.html#character-encodings-0

I say go for it.  Because: CP1252 simply hadn't occurred to me.
If it had, I would have declared CP1252 the fallback, instead of Latin-1.
(I would have done this especially if I had seen that W3C recommendation! They do not do such things lightly.)

So wherever you see Latin-1, in the spec or in this message, I retcon-wise intended CP1252!

Ya know, also, maybe if we're in Latin-1 mode even from an explicit "=encoding latin-1", maybe 0x80-0x9F sequences should be forgiven as being in CP1252. It would be something that I, just as a *user*, would appreciate. I would never turn on true Latin-1 for the purposes of constraining me from using characters that CP1252 has in 0x80-0x9F. But you guys figure it out.


And incidentally, it's up to you all to figure out whether \d in the =item thing should be changed to [0-9], now that \d can be all Unicodey and can be digits like ३٢๔༩၇ etc.,... Or I dunno, maybe someone writing pod Thai documentation does want "=item ๖", in which case the current regexp is... correct? I guess?


BTW: "pod Thai" GET IT?  PAD THAI.. Like the NOODLES! HAAAAAAAH!


Ten years I have waited for that.  Ten.  Ten.

Reply via email to