Re: perlpodspec, draft 1

Sean M. Burke Mon, 20 Aug 2001 18:38:44 -0700
At 02:02 PM 2001-08-20 -0700, Tim Gim Yee wrote:
>[...]
>This is probably already implicit, but I will make it explicit.  The blank
>lines between verbatim paragraphs constitute significant whitespace, and a
>parser must pass those blank lines verbatim to the formatter.  In other
>words, if there are 4 blank lines between two verbatim paragraphs, the
>formatter should see and render 4 blank lines.  It would be an error to do
>otherwise.
>[...]

Hm, that complicates things a bit.  What I was thinking of was: If you're
building a parse tree, then after the tree is done, but before you return
it, you should walk the tree and concatenate adjacent verbatim nodes,
inserting just a blank line between their content.
And then I added something to the effect that it's okay if event-based
parsers can't pull that off.

It was my impression that this required particularly unpleasant lookahead,
in event-based parsers.  I.e., when you see a blank line terminating a
paragraph, you can't just act on it --because what if it's that kind of
significant-blank-line that can be between verbatims?

I think it might make for hard-to-write code.  I don't want to require POD
parser implementers to write code that I myself wouldn't be willing to
write (not just barely able, but willing!).

I already wrote code to implement perlpodspec's idea of paragraph/section
boundaries, to convince myself it wasn't hard -- but I'll try coding up
your idea -- if /I/ see a simple way to do it, then I think other people
will be able to.

(And BTW, noone should think this is an invitation for anyone to mail me
code.  "See, I implemented it, so it's simple!!!"  No.)


>And although perlpod already illustrates it clearly, I will restate the
>obvious: "=begin X" is terminated by a matching "=end X".  A
>parser/formatter must not terminate "=begin X" with "=end Y", "=end Z", or
>any "=end" command without a matching format name.  A "=begin X" without a
>matching "=end X" is an error.

Yes, that sounds good.  I was going to make the language firmer (as I
should and will), but I kept vacillating on whether =begin...=end sequences
should nest, until (late in the process) I came up for arguments for
nesting, and none against nesting.


>> Some pod formatters output to formats that implement nonbreaking
>> spaces as an individual character (which I'll call "NBSP"), and
>> others output to formats that implement nonbreaking spaces just as
>> spaces wrapped in a "don't break this across lines" code.  Note that
>> at the level of pod, both sorts of codes can occur: pod can contain a
>> NBSP character (whether as a literal, or as a "EE<lt>160>" or
>> "EE<lt>nbsp>" sequence); and pod can contain "SE<lt>foo
>> IE<lt>barE<gt> baz>" sequences, where "mere spaces" (character 32) in
>> such sequences are taken to represent nonbreaking spaces.
>
>S<> confuses me.  Does a tab ("\t") map to a NBSP also?  [...]
>What about ("\n")? 

I should add examples.  My idea of this is that all literal whitespace will
have been collapsed (s/\s+/ /g) before ANY interior sequences are even
parsed, so that the only effect of S<...> is to mean "under this S node,
space (0x20) mmeans NBSP."  Some of those 0x20s may have started out as
strings of literal space-tab-space-return-space-space-space.

>> Note that section names might contain markup.  I.e., if a section
>> starts with:
>>
>>   =head2 About the C<-M> Operator
>>
>> (or "=item About the CE<lt>-M> Operator"), then a link to it would
>> look like this:
>>
>>   L<somedoc/About the C<-M> Operator>
>
>Would L<somedoc/About the -M Operator> work?
>Would L<somedoc/About the B<-M> Operator> work?

I don't require a processor to recognize these as synonymous.
In practice, a processor /might/ just decide to do that -- i.e., that the
simplest way to deal with link-section-names with markup is to remove the
markup (not for any rendering, but for purposes of coming up with the
anchor name, i.e., the thing in the <a name="anchor name">foo</a>, or in
the <a href="foo#anchor%20name">bar</a>).

Example: L<over there|somedoc/About the C<-M> Operator> and
         L<over there|somedoc/About the -M Operator> and
         L<over there|<somedoc/About B<C<B<< C<the> >>>> C<-M> Operator>
...MIGHT all produce, in HTML:
 <a href="some_docbase/somedoc.html#About%20the%20-M%20Operator">over
there</a>
And I think that's harmless.
I don't require it, but it seems permissible.

If I thought it reasonably likely that someone would have a document with
two /different/ sections:

  =head1 About Foo

  =head1 About C<Foo>

THEN I would have to insist on keeping the markup (or rather, the structure
that it represents) when generating the archor name.

(If I said you couldn't toss out markup in building the anchor name, then
you'd have to do something strange like take the tree fragment that "About
B<C<B<< C<the> >>> C<-M> Operator" parses into, and serialize it, like so:
  "About B\caC\caB\caC\cathe\cb\cb\cb\cb C\ca-M\cb Operator"
That's before you &-encode it or %-encode it, of course.
That would faithfully preserve the distinction between "=head1 About Foo"
and "=head1 About C<Foo>" (and between links to them), and I thought of
mentioning this in the documentation, but I decided it would just frighten
people.  "Good God, I have to DO THAT?!"  No, you don't.

And recall that the typical rendering format is not hypertext, so most
people don't have to worry about this.  In fact, this is of interest almost
exclusively to people who are writing a pod2html.  And once I'm done with
this perlpod/perlpodspec thing, I'll see about fixing the standard-dist
pod2html so that it won't be so appalling that people feel that they need
to go writing their own.

>Given:
>
>    =head2 About the B<C<-M>> Operator
>
>Would the link look like none, any, some, or all of these?
>
>    L<somedoc/About the B<C<-M>> Operator>
>    L<somedoc/About the C<B<-M>> Operator>
>    L<somedoc/About the C<-M> Operator>
>    L<somedoc/About the -M Operator>

You take the content of the head2, escape any literal / or |'s, and you put
it in L<somedoc/there>.
So:
  L<somedoc/About the B<C<-M>> Operator>

>How would I create links to these?
>
>    =item rot13($text)
>    =item rot13 [TEXT]
>    =item $obj->rot13($text)
>    =item $grfg = $obj->rot13($text)
>
>Would the following work?  For all of the above?
>
>    L<somedoc/rot13>
>    L<somedoc/rot13()>
>    L<somedoc/rot13($text)>
>    L<somedoc/rot13(TEXT)>

Do all of the latter list mean to link to all of the former?  No.

The processors are assumed to have little in the way of telepathy, and
should not be expected to 
turn L<somedoc/rot13($text)> into <a href="somedoc/rot13">...</a>, nor 
  =item $grfg = $obj->rot13($text)
into
  <dt><a name='rot13'>$grfg = $obj-&gt;rot13($text)</a>
               ^^^^^
but instead are assumed to do something like
  <dt><a name='$grfg = $obj-&gt;rot13($text)'>$grfg =
$obj-&gt;rot13($text)</a>

Altho if a pod2html author decides that you can't get away with just
anything in an a name fragment (which I suspect is quite true -- notably
/I/ wouldn't try using characters over 126 regardless of whether they're
&-encoded / %-encoded), he may elect to drop/substitute any characters he
considers troublesome:
  <dt><a name='$grfg_=_$obj-_rot13($text)'>$grfg = $obj-&gt;rot13($text)</a>

Given L</The E<euro>1000 Prize!>, this might mean generating an anchor name
of "The__1000_Prize_" or "The_X1000_PrizeX" or "The_1000_Prize" or
"The1000Prize".  You don't want to drop too much tho, or all the $/ $[ $"
etc things in perlvar all get smashed into "__" or "XX".
(An alternateive approach, instead of dropping or dumbly substituting, is
encoding -- say, using  "XcharnumY" to signify any problematic character.)



But my point here is not to write pod2html and call it a specification.
(I've already done that in my head, but I'm merciful enough to have kept it
there.)


>The whole reason I'd want L<scheme:...> is so I could do
>L<text|scheme:...>.  L<http://www.perl.org/> is really not much better
>than http://www.perl.org/ sans L<>, given that pod2xxx should mark it
>up for me anyways.

Your pod2xxx may.  But it might not.  L<url> is meant to make it explicit
and required, not leaving it to individual processors.
/Requiring/ processors to turn a bare url into a hyperlink, or even a foo()
into a C<foo()>, really bothers me -- I'm not even sure I like formatters
doing those things at all, but I'm being benificent and not forbidding it.


Lapsing into opinion here:
Autojujufication like that -- with the URLification and the implicit C<...>
and the quotes turning into 66s and 99s -- is a whole class of things that
can never be done as reliably as I want; and when they're done, they're
usually just inconsistent enough to really annoy me.  And if you write
rules complex enough to get this magic right /most/ of the time, they're
too complex to remember, so I can never rely on them anyway -- AND THEN you
have to have some nonapparent way to KEEP the rules from applying, like,
say, getting Z<>$3 to suppress $3 magically turning into C<$3>.
Also: deciding when, in the processing model, the various kinds of
autojujufication should apply, is bothersome and possibly frought with
minor paradoxes.

If people want to try putting some of these things in their particular
processor, they're welcome to try, but 0) not on by default!, 1) I really
think it's not worth the bother, 2) remember that the magic heuristics
/will/ fail and be inconsistent, annoyingly so.


>What sort of parsing and rendering problems does this restriction
>avoid?  I'm clueless as to any parsing problems.  I'm guessing the
>rendering problems relate to non-hypertext formats.  [...]

Don't get me started on that, really.  The actual code to implement just
correct /parsing/ of L<...> is already horrendous, and adding L<text|url>
would make it TOO CRAZY.
I'm drawing a line in the sand (with my iron fist! or, failing that, my
iron fish), and saying THIS CRAZY, AND NO CRAZIER!

Moreover, yes, the L<text|url> /rendering/ problems are in non-hypertext
formats.  You outlined all the rendering possibilties that I thought of,
and I really really don't like any of them, so that's the
semantic/rendering reason why I say no L<text|name>.


>BTW, how would pod2text and pod2man render the following?
>
>    Set the L<input record separator|perlvar/$E<sol>> to ""...
>
>Do they just scrap the link info?
>
>    Set the "input record separator" to ""...

Yes, they do, and I approve of that.

(So Authors are on their honor to not do things like
  L<input record separator|Something::Irrelevent/Really>
)

>And could I have written it like so?
>
>    Set the L<input record separator|perlvar/"$/"> to ""...
>
>Or must the "/" always be escaped?

perlpod rewrite draft 1 says:

                (Text, name, and section cannot contain the characters
                '/' and '|', and any '<' or '>' should be matched.
                Moreover, name should not contain spaces.)

Whether you could get away with L<input record separator|perlvar/"$/"> with
a particular parser is another story.  With the one in my head, I'm pretty
sure it would work.  However, if you obey that perlpod rule, and use E<sol>
instead of the /, then everyone is happy and no-one gets hurt.  Nothing to
see here, folks, move along.

(I tried rewriting that rule so that it would explain only what things
/need/ escaping, and it was a maze.  Life is simpler with the nice simple
above rule.)


So, is this more than you ever wanted to know about POD?


--
Sean M. Burke  [EMAIL PROTECTED]  http://www.spinn.net/~sburke/
Re: perlpodspec, draft 1

Reply via email to