Re: Pod as shorthand for XML

Adam Turoff Fri, 04 Jan 2002 08:11:18 -0800

Sorry this is so long.  This idea comes up every so often, and I don't
remember the last it was possible to lay all the issues on the table in
a single message.

On Thu, Jan 03, 2002 at 02:03:48PM -0700, Sean M. Burke wrote:
> So I've been thinking many Deep Thoughts lately about Pod.
> 
> I have competing goals in the design of Pod as a document format: 

OK.

> The first and foremost goal is the absolute requirement that Pod be
> sufficient for easily writing text documentation, and that its semantics be
> simple enough for all its constructs to be easily translatable into any
> sane markup language or typesetting system. 

OK.  

> The second goal is that Pod be extensible enough that you could use it as a
> sort of "Huffman-coding for XML" [...]

Er, um, uh, why?  And which definition of XML are you using?  The simple
definition of a well-formed tagged document where block and inline tags
are conflated, or the whole big tur^Wshiny ball of metal that includes
schemata, hyperlinking, namespaces and whatnot?

It would be really nice if the second didn't induce carpal tunnel
syndrome.

It would be sufficient if the first was explicitly targeted as an goal
for an extension of Pod.

It would be an interesting hack to target the second goal, but one that
would probably never be adopted outside a small group of extremists (see
http://www.yaml.org, sml-dev, etc.).

Getting back to the first goal, remember that Pod is a *formatting*
language, and XML is simply a grammar.  Yes, a grammar, not a format.

The ideal use case for XML begins with an XML vocabulary that completely
divorces structure from presentation, and forces presentation to be
handled by Some Other Program(tm).  Here, Pod is like HTML, *TeX and
*roff, but only less so.  It's a good lowest common denominator format
that is roughly interchangeable between these other formats.

But Pod's greatest strength is also it's greatest weakness:

        =head1 Author and Copyright Information

        Copyright (c) 1997-1999 Tom Christiansen and Nathan Torkington.
        All rights reserved.

        [...]

Or even this:

        =head1 SYNOPSIS

        B<perl> S<[ B<-sTuU> ]> S<[ B<-hv> ] [ B<-V>[:I<configvar>] ]>
            S<[ B<-cw> ] [ B<-d>[:I<debugger>] ] [ B<-D>[I<number/list>] ]>
            S<[ B<-pna> ] [ B<-F>I<pattern> ] [ B<-l>[I<octal>] ] [ B<-0>[I<octal>] ]>
            S<[ B<-I>I<dir> ] [ B<-m>[B<->]I<module> ] [ B<-M>[B<->]I<'module...'> ]>
            S<[ B<-P> ]> S<[ B<-S> ]> S<[ B<-x>[I<dir>] ]>
            S<[ B<-i>[I<extension>] ]> S<[ B<-e> I<'command'> ]
            [ B<--> ] [ I<programfile> ] [ I<argument> ]...>

Keep in mind that Pod is designed to communicate directly to a human
audience via a formatting system of some type.  Ideally, a Huffman coded
XML version of these documents would allow the author to specify:

        - who are the authors?
        - where can they be contacted?
        - when was the document copyrighted?
        - under what terms the document was copyrighted? (GPL, LGPL, AL)
        - what is being protected by this copyright statement? (this
          document?  a module?  a distribution?)

        - what is B<perl> here?  A piece of boldfaced text, or a program
          name?
        - Is this string of text formatted literally, or is it
          something more structured (e.g. a command synopsis typically
          found at the beginning of a manpage)?
        - what is B<-d>?  Is it an operator that tests for the presence
          of directories?  Is it a command line switch?  A boldfaced
          negative d?
        - what do I<configvar> and I<debugger> mean exactly?  Are they
          optional parameters?  Are the preceding colons required when
          they appear?  Italicized words?
        - What exactly is a pattern, as described by B<-F>I<pattern>?
          a regex?  A swatch of cloth?  A literal string?
        - Is B<-hv> one switch or two?  If two, are they always used
          together?  Do they serve similar functions?  Do they serve

DocBook allows most of these questions to be answered explictly;
formatting these items as bold, plaintext, italics, concatenated into
paragraphs, etc. is necessarily handled by a stylesheet that associates
these formatting properties to specific types of text.  If Pod were to
solve this problem in a Pod-like manner, then it would intuit most of
the answers to these questions much like it intuits literal sections (or
over eagerly intuits "ls(1)" to be shorthand for "the ls(1) manpage").
The grammar would be nasty, the parser would have a lot of special
cases, but the complexity would be where it belongs -- with the parser,
not with the documentation format.

There are two core issues here that have always been conflated in Pod:

        - Pod is a simple authoring syntax; the syntax has been
          oversimplified to better match the problem domain of authoring
          documents, and the complexity has been shoved to the parser
          (rather than the other way around, e.g. XML)

        - Pod is a compact formatting language specifically designed for
          writing documentation that is directly targeted at a human
          audience

It's the second point that's the thorn.  It's also the second point that
gives the 80/20 benefit.  The first point is a holy grail.

> (as I remember Larry once expressing the
> idea, altho I'm quoting from memory, as I can't now find the exact
> message). 

I've heard what Larry has to say about XML, and I don't think he fully
groks the difference between presentational markup and structural
markup.  Pod (the language) simply isn't structural (quibbles about
=head1 vs. =sect1 notwithstanding).  That isn't to say that Pod can't do
a better job, or that Pod can't be extended ever so slightly to hit an
80/20 spot in a much wider domain.  But Pod as defined in perlpod and
perlpodspec are quite simply formatting languages.

So, from here on out, I'll talk about the possibilities of extending the
Pod *syntax* to be a huffman coding for XML.  This means keeping command
paragraphs as block tags, formatting codes as inline tags, paragraph
parsing, and (possibly) verbatim paragraphs -- and ignoring the semantic
behaviors of =head1, C<>, B<> and the like.

The best idea Larry has blessed so far is the =use clause that
optionally makes Pod behave differently.  That is, the definition of Pod
as a syntax is pretty stable, the definition of the formatting semantics
is sacrosanct, but the definitions of new semantics is up for
discussion.  Use of the standard semantics in a new context is similarly
up for discussion.

Here's a proposal (From Ilya, by way of Sean):

> So instead of:
> 
> The <emphasis>destructor</emphasis> (<function>DESTROY</function>) for the
> object <literal>$b</literal> will be called...
> 
> You would have something like:
> 
> =equate M emphasis,B
> 
> =equate U function,C
> 
> =equate T literal,C
> 
> and then anytime later...
> 
> The M<destructor> (U<function>) for the object T<$b> will be called...

The obvious issue here is that there are only so many uppercase
ASCII characters.

I wrote up some ideas for TPC5 about an experiment to create a new Pod
language using something an extension of basic Pod syntax.  The jump to
from m/[A-Z]<+.../ to m/[A-Za-z]{1,2}<+.../ isn't that great.
Furthermore, ln<> or link<> is so much more intuitive and
self-describing than L<>.  (I experimented with multiple spellings of
the same tag name: em<>, emph<>, emphasis<> => <emphasis>...</emphasis>;
lit<>, literal<> => <literal>...</literal>; but 2-4 characters seemed
sufficient, as Larry has said on many occasions.)

The =equate proposal only addresses formatting codes, and surely new
block codes will be necessary to extend beyond =head1 and =begin/=end.
If Pod is to be a huffman encoding for XML, then we can do much better
than =begin table/=end table "compressing" <table>...</table>.  :-)
Presumably, a similar mechanism could be created to define new command
paragraphs, both "standalone" commands like =head1 and "block" commands
like =begin/=end.

However, this spontaneously recreates the problems of *roff (and perhaps
TeX) that have been solved by SGML and XML.  James Clark wrote groff
because there was no GNU replacement for *roff, and because it seemed
like a good idea at the time.  However, in the process of working so
deeply with the *roff language, James found that conflating the macro
language with the formatting language (or markup language if you perfer)
really makes things quite difficult to maintain, and just isn't as
expressive as it ought to be.  The experience drove him to SGML, and
write sp/jade/etc.  (He discusses this in a recent interview in Dr. Dobbs).  

Like many XML folks, I trust James implicitly when it comes to markup
languages; if he says that adding a macro facility such as =equate is a
bad idea in a markup language, then it's a Bad Idea(tm).  

James has also been pretty down on XML DTDs of late, but the more I
look at the issue of extending Pod, the best idea I've seen worth
stealing is the SGML DTD.

Consider this:

        - Pod (the syntax) parsers do nothing more than convert a
          document in the Pod syntax into a series of events or a tree
          (the two most popular parsing APIs).

        - Pod (the language) formatters take this tree of events 
          (possibly validated through a Pod checker to barf on 
          illegal constructs such as E<0 1 2 3>) process these
          events/trees into something else (new Pod documents, HTML, 
          spell-checked Pod, word-count summaries, etc.)

Given that, consider:

        - A Pod formatter that comes across a =use clause may load
          a Perl module that contains:

          - code for processing new linguistic constructs (e.g. =list,
            =table, =bibliography)

          - code for formatting them appropriately (?)

Note that the "DTD" in this case is actually Perl code that defines
these new tags/"linguistic constructs", but also contains the code to
process them.  This is similar to the SGML DTD in that a document cannot
be parsed without the document definition, but improves upon it in that
it is not simply a declaration of what is valid, but also the validator
itself.  Using an invalid target in a =use clause is obviously an error; 
support of the =use clause is manditory for an "extensible Pod"
formatter, but completely optional for a standard Pod formatter (thus
maintaining Pod the formatting language as sacrosanct).

If that's the case, then we can have documents such as this:

        =use p5ee.component

        =name ElfinSword::OrcFinder

        =interface

        [...]

        =prereq ElfinSword ElfinMagic OrcFinder

        =version 1.0

        =author Keebler the Immortal Elf

        =maintainer Frodo Baggins <[EMAIL PROTECTED]>

        =copyright GPL

        =pod

        =head1 ....

No, it's not a simple document.

Yes, it's easy to understand, easy to create, easy to validate, 
easy to process.  Yes, it's more explicit and contains much more
metadata than jrandom.pod *for a specific problem domain*.  

No, it's not an extension of the Pod formatting language, it's a
new language that reuses the Pod syntax.

No, it's not a replacement for DocBook.  Yes it's a replacement for
Carpal-Tunnel-P5EE-ML.  Yes it's better than punting with
        =for xml
        <p5ee-component>...</p5ee-component>

>From here, it's a SMOP to add support for other XML constructs, such as
        PIs
        comments
        attributes
        namespaces (for specific problem domains)
        deeply nested blocks
        complex markup

All that's required is a "validator" that recognizes Pod constructs that
map to these XML constructs, as well as the appropriate formatters to
emit them.  As always, the formatter for a hypothetical p5ee Pod format
is an exercise left to the reader, much like the formatter for FooML.

> While I realize that these are all "problems" not with Pod, but with the
> attempt to allow use Pod as a shorthand for XML.  But like I say, if
> there's some way to kill many birds with one stone (without requiring that
> stone be the size of Ireland, be in hyperspace, and/or be made of
> neutronium), it'd be nice to do, so that we could spread around the
> numminess of Pod!
> 
> Thoughts, anyone?

Yes.  Keep up the good work.

Z.

Re: Pod as shorthand for XML

Reply via email to