Re: The Encoding Warning (Again) - some data

Sean M. Burke Tue, 28 Aug 2012 21:16:41 -0700

Hi all!

I'm still on this pod-people list, but I very rarely read it-- andusually just a glance at the subject lines.But today I looked, saw i18n stuff going on, and I gave things a quickskim. Well,...



First-and-foremost: perlpodspec is no longer mine.

But I can weigh in on my original intent in bits and pieces of it, forwhatever that is worth. My intent was:For files with no encoding declarations, when the parser hits 8-bitstuff, it should politely DWIM when it's sensible and possible.

And if I were writing the spec over again, I would spell that out: ifthat heuristic is applied and it is unproblematic (i.e., notcontradicted later on, nor contradicting a BOM), then keep quiet.

I know we should hold CPAN module authors to a high standard ofdotting their i's and crossing their t's ...and declaring their encodings.

But I think we should also hold pod parsers to a high standard ofkeeping quiet when simple heuristics are unproblematically applied.

(Pod parsers already do some silent heuristics,... well, I put the"heuristics" into the spec itself, just to make sure that they wouldget universally and silently implemented. Example: numeric itempoints, "=item 3", are actually anything that matches

  m/\A=item\s+\d+\.?\s*\z/
...That's a moderately forgiving regexp.)

And now I consider that perl itself is *thick* with silent heuristics(such as: the pattern /goo$/ has nothing to do with the variable $/...to say nothing of what z/8/g can do, depending on prototypes).And considering those things, I would change a "should" to a "must",and I would say:

A pod parser must obey that encoding-guessing heuristic, and it mustdo so silently--

* Unless the result of the heuristic is contradicted by a differentkind of byte sequence later, or by a contradictory "=encoding" line.(In that case, yes, complain!)


* Or unless the parser is asked to emit a warning (otherwise: stay quiet).

And the warning should be asked for probably just in the situation ofthe code being parsed as part of some "critique my code in EVERY waypossible!" program-- not in the context of a user simply trying toinstall a module.(Did I say *a* module? Worse situation: when the user is simplytrying to install one module, then the installer reports that thatmodule has 5 dependencies, hey let's install them all, but dependency#3 is the one causing the discomfortingly baffling "Highbit but no=encoding!?" warning to fly by.)

Users' and authors' attention is better spent elsewhere than chasingwhy a previously all-OK module now throws warnings because Garciachanged into García. (And that difference might be invisible in ourfun scrunched up programmerese fonts!)

But the above was just my intent. Since then, there's been a decadefor people to find trouble with possible "utf8 or latin1?" heuristics,or to see complications that need dodging,...

So it's up to you all, and I won't debate anyone's conclusions.


Oh, and I hope I can help here:

On 08/27/2012 10:03 PM, Grant McLean wrote:

On a tangentially related note, I was pondering whether the heuristic
should actually fall back to CP1252 rather than ISO8859-1 - after all
that's what the W3C recommend:

   http://www.w3.org/TR/html5/parsing.html#character-encodings-0


I say go for it.  Because: CP1252 simply hadn't occurred to me.
If it had, I would have declared CP1252 the fallback, instead of Latin-1.

(I would have done this especially if I had seen that W3Crecommendation! They do not do such things lightly.)

So wherever you see Latin-1, in the spec or in this message, Iretcon-wise intended CP1252!

Ya know, also, maybe if we're in Latin-1 mode even from an explicit"=encoding latin-1", maybe 0x80-0x9F sequences should be forgiven asbeing in CP1252. It would be something that I, just as a *user*,would appreciate. I would never turn on true Latin-1 for the purposesof constraining me from using characters that CP1252 has in 0x80-0x9F.But you guys figure it out.

And incidentally, it's up to you all to figure out whether \d in the=item thing should be changed to [0-9], now that \d can be allUnicodey and can be digits like ३٢๔༩၇ etc.,... Or I dunno, maybesomeone writing pod Thai documentation does want "=item ๖", in whichcase the current regexp is... correct? I guess?



BTW: "pod Thai" GET IT?  PAD THAI.. Like the NOODLES! HAAAAAAAH!


Ten years I have waited for that.  Ten.  Ten.

Re: The Encoding Warning (Again) - some data

Reply via email to