digression into SGML CM -> RE

Sean M. Burke Sat, 18 Dec 1999 11:54:36 -0800
At 06:23 PM 1999-12-16 -0800, Randal L. Schwartz wrote:
>>>>>> "Sean" == Sean M Burke <[EMAIL PROTECTED]> writes:
>Sean> I was working on a CM-to-regexp translator, which I think I have
>Sean> working right, by the way -- for XML content models, that is -- SGML
>Sean> CMs permit the & operator, which has no straightforward equivalent in
>Sean> rexexp.  This was so HTML::AsSubs could do runtime checking of content
>Sean> models.  (A feature I've not yet gotten around to adding.)
>
>I don't know what & does in an SGML CM, but if you need an "and"
>in a regex, just use zero-width positive lookahead:
>
>        / (?= .*? foo) (?= .*? bar) /x
>
>True iff there is both a foo and a bar somewhere in the string.

Hm, I never even thought about using those.  I'm amazed how often I
completely "forget" about some of the features of regexp.

Here, for the ed(ement)ification of the list, is the some paste from
http://www.w3.org/TR/html40/intro/sgmltut.html#h-3.3.3.1

[start quote]
Content model definitions 

The content model describes what may be contained by an instance of an
element type. Content model definitions may include:

The names of allowed or forbidden element types (e.g., the UL element
contains instances of the LI element type, and the P element type may not
contain other P elements).  DTD entities (e.g., the LABEL element contains
instances of the "%inline;" parameter entity).

Document text (indicated by the SGML construct "#PCDATA"). Text may contain
character references. Recall that these begin with & and end with a
semicolon (e.g., "Herg&eacute;'s adventures of Tintin" contains the
character entity reference for the "e acute" character).

The content model of an element is specified with the following syntax.
Please note that the list below is a simplification of the full SGML syntax
rules and does not address, e.g., precedences.

( ... ) 
          Delimits a group.   [for grouping, not capturing --SMB]
A 
          A must occur, one time only. 
A+ 
          A must occur one or more times. 
A? 
          A must occur zero or one time. 
A* 
          A may occur zero or more times. 
+(A) 
          A may occur. 
-(A) 
          A must not occur. 
A | B 
          Either A or B must occur, but not both. 
A , B 
          Both A and B must occur, in that order. 
A & B 
          Both A and B must occur, in any order. 

Here are some examples from the HTML DTD:

            <!ELEMENT UL - - (LI)+>

The UL element must contain one or more LI elements.

            <!ELEMENT DL    - - (DT|DD)+>

The DL element must contain one or more DT or DD elements in any order.

            <!ELEMENT OPTION - O (#PCDATA)>

The OPTION element may only contain text and entities, such as &amp; --
this is indicated by the SGML data type #PCDATA.

A few HTML element types use an additional SGML feature to exclude elements
from their content model. Excluded elements are preceded by a hyphen.
Explicit exclusions override permitted elements.

In this example, the -(A) signifies that the element A cannot appear in
another A element (i.e., anchors may not be nested).

            <!ELEMENT A - - (%inline;)* -(A)>
[where %inline; is a private entity for "STRONG,EM,A,IMG,..." --SMB]

Note that the A element type is part of the DTD parameter entity
"%inline;", but is excluded explicitly because of -(A).

Similarly, the following element type declaration for FORM prohibits nested
forms:

            <!ELEMENT FORM - - (%block;|SCRIPT)+ -(FORM)>

[end long quote]


Now, first off, much of what the above says describes features that aren't
in XML.  I'm on expert on this, but I think that what's missing from XML is
the '&' operator, and the whole +(A) or -(A) business.  So an XML content
model really looks a lot like a RE (except that characters are tokens in
REs, but in CMs, it's elements that are tokens).
There's also a requirement I'm not sure I understand, but I think means
that the CM-matcher-engine shouldn't ever have to backtrack.  I.e.,
zaz,((foo,bar)|(foo,baz)) is no good.  Because if it's trying to match
against a content string starting out "zaz foo...", it would have to try
matching the next bit against (foo,bar), and if that fails, to backtrack
and match against (foo,baz).

But anyway, on the off chance anyone might see a way to implement '&', here
goes:

The CM "foo & bar & baz" will match the element series consisting only of
those three elements, once, in any order: "foo bar baz", "foo baz bar",
"bar foo baz", "bar baz foo", "baz foo bar", and "baz bar foo".  I believe
the items in a & series can be grouped expressions: "foo & (bar)+ & baz";
altho I think the interpretation of this is a bit counterintuitive: that
matches "bar bar bar bar foo baz" but /not/ "bar bar foo bar bar baz".
I've also gotten the impression that this can react a bit strangely with
the no-backtracking rule, so that I believe "(a,b)|(x & y & a)" is bad, and
I'm not even sure what to do with "(a*,b)|(x & y & a*)", since I'm not
entirely sure how & would be implemented in an RE-like engine, much less
how it would work with that engine's concept of when to backtrack.

Anyhow, I've not heard anything forbidding the &-series from being embedded
in a larger expression, so I believe thi is valid

 CM for element "rant":
                     (
                      invective+
                      |
                      (accusation & exculpation & mocking+)
                      |
                      (fingerpointing, hotair*)
                     )

Now, to CM-to-RE translation:  given a simple XML CM, like this CM for
XHTML "table":
  CM:  (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))
Suppose we have an actual HTML::Element object in $table, and want to check
it.  I think one could do /something/ like:

my $content_string =
   map {
     # here using <> just to delimit tokens
     if(ref) {
       '<' . $_->tag . '>';
     } else { # text segment
       '<PCDATA>'; # or whatever
     }
   }
   @{$table->content || []}
;
# now $content_string is like "<thead><tr><tr><tr><tr><tr>"
print "$table_content is OK!\n" if $table_content =~ m/$table_cm_re/;


Where the RE should be something we cook up like so:
CM:  (   caption  ? , (   col  * |   colgroup  * ),   thead  ?,  tfoot  ? ...
RE: ^( (<caption>)?   ( (<col>)* | (<colgroup>)* )  (<thead>)? (<tfoot>)? ...

CM continued:  (  tbody  + |   tr  +)  )
RE continued:  ((<tbody>)+ | (<tr>)+)  )$

(Actually, the parens in the RE need only be (?: ... ), but it doesn't
matter.)
So, iff $table_content matches this:
 m/^
   (
     (<caption>)?
     (
       (<col>)* | (<colgroup>)*
     )
     (<thead>)?
     (<tfoot>)?
     (
       (<tbody>)+ | (<tr>)+
     )
   )
   $
  /xs
then it's good.  And I've got a fairly trivial function that, given a valid
XML CM string, will always output a valid RE.  It doesn't do inclusions or
exclusions (which are another, minor, topic), and it doesn't do &.

Now, my first hack at & was to turn this:
  CM:  fish & fowl
into this:
  CM:  (fish,fowl)|(fowl,fish)
And therefore, recurrendo, to turn this:
  CM:  foo & bar & baz
Into this:
  CM:  
       (foo,
         (bar,baz)|(baz,bar)
       )
       |
       (bar,   
         (foo,baz)|(baz,foo)
       )
       |
       (baz,
         (bar,foo)|(foo,bar)
       )

However, the size of the resuling expression obviously grows insanely fast,
so that "tansut & spaceghost & guest & announcer & brak & zorak & moltar &
herculoid+ & jan? & jace+ & cyclo & drnightmare" (which is perfectly
conceivable as an SGML CM) is quite book-length.

Now, can anyone else see a better way to implement &?

I can imagine a way to implement & with embedded code-in-RE; or one could
simply write a simple RE engine, cf. Mark-Jason Dominus's Regexp.pm from a
few TPJ issues ago.  But I think both are less desirable than a way that
would spit out a pure RE, given any CM string.

--
Sean M. Burke [EMAIL PROTECTED] http://www.netadventure.net/~sburke/
digression into SGML CM -> RE

Reply via email to