Grammar Rules For Inset Paragraphs The Haskell rules look quite complex and there are lots of issues that I will ignore, but I hope the essence of the following approach may be helpful to you.
A grammar can specify a simple inset paragraph if it has the ability to specify that the inset of a line matches the `same` inset as the first line of the paragraph. But this ability to specify the `same` match as a previous production is not possible with a CFG or a PEG. There have been various grammar language extension mechanisms and work-arounds used to cope with this problem. The same problem appears in an even more obvious way in the XML grammar where the end tag name must match the `same` name as the start tag. The XML grammar simply leaves this out of the grammar rules, it is a restriction that is pushed out as a semantic check on the parse tree. I have been using a grammar with an operator to extend a PEG or CFG with the ability to match the `same` value as a previous match. The prefix operator notation @x means that the rule must match the `same` value that the previous x rule matched (I'll define this more fully shortly). For example, a simple inset paragraph can be defined as: para = inset line (eol @inset line)* inset = (space | tab)+ line = char+ I have used a generic BNF grammar rule sybtax here, I'll only use mutually exclusive choices in the examples here, so you can translate it into the syntax of a PEG grammar or a CFG grammar as you like. The first inset rule is a normal grammar production, and the @inset matches the `same` match value the previous inset rule matched. The inset has been specified in this case as one or more space or tab, and this requires that the exact `same` inset is used on each line. The issues of tabs, and the tar-pit of mixing them with spaces is legend, but I don't want to get side-tracked into that issue, there is no good answer, and lots of pragmatic design for different applications. It would be really nice if we could insist on only tab characters for insets, but often that is seen as too rigid. Another example of a `same` match is an XML element tag name match: element = '<' tag '>' content '</' @tag '>' tag = name content = (element | text)* The first tag is a normal match, but the second (end tag) is defined as @tag meaning the `same` match that the first (start tag) matched. Notice that the content of the element may contain nested elements, and the @tag has to match the start tag of the same element (not just the last value that any tag matched). The @x operator is specified as a syntax tree reference, the `same` match is defined as the last previous sibling tag node in the (partially generated) parse tree. It is quite easy to implement this in practice since it is just a relative parse tree node value reference (used within the parser), albeit that the parse tree is not yet fully constructed. To fully define the @x operator we need to specify what happens if there is no previous sibling node in the parse tree. This is defined to mean that the @x is an empty match (i.e the empty string '' in the concrete grammar, or the e symbol in the formal grammar). This has the interesting and useful property that a nested paragraphs can now be defined. A nested paragraph has to have a `greater` inset than the current paragraph: para = inset line (eol (@inset line | para))* inset = @inset (space | tab)+ line = char+ The first time the inset rule is applied it will not find an @inset match for a prior inset rule, and will therefore just match one or more space or tab. A subsequent nested paragraph must have a greater inset, since its inset has to match the current inset (the @inset) followed by one or more space or tab. I have used this @x operator in playing around with grammar rules for many years and have found it works quite well for a variety of nested inset paragraph grammars that I have used (and other things). It is ironic that the IETF RFC for the ABNF specification says that ABNF rules may be continued on the next line if they have a greater inset than the first line (of the rule). Of course the ABNF grammar for ABNF could not include this specification! They could have simply insisted that ABNF rules did not start with an inset, and then a continuation line could be defined as an eol followed by one or more tab or space. But the definition of a greater inset is so much better; its just a pity that it can't be specifeied in any CFG (or PEG), and this is often an real grammar deficiency in practice. The notation can be generalised a little further if desired, to allow the `same` previous match to be applied to a child in a previous production, so that @x.y matches the `same` value that the prior child y of x, matched: element = start content end start = '<' tag '>' end = '</' @start.tag '>' tag = name content = (element | text)* The @x.y notation searches back though the parse tree for the first previous x node sibling, or a parents sibling. The match is the value of the child x of this y node. The @x.y notation is quite like an Xpath expression, but I have used @x.y rather than @x/y to avoid confusion with the PEG choice operator (and trying to use an Xpath backwards axis expression for this purpose is very ugly!). As an aside, the method used by the @x operator can be included as a normal parse tree access method that can also be used by an application after the parser has generated the full parse tree. I see that other people have used parameterised rules to define grammar rules with a variable inset size (eg the YAML grammar). But I don't think this appraoch can be implemented nearly as easily. I like my simple @x operator much better :-) Peter. ---- John Meacham <[EMAIL PROTECTED]> wrote: > Hi, I am the author of the jhc haskell compiler as well as the frisby > packrat parsing library for haskell. > > I would like to parse all of haskell via a packrat parser, generated by > pappy, However, Haskell has a 'layout' rule which is causing some > trouble. basically, in haskell, blocks are defined by layout. > > do foo > bar > let x = ... > y = ... > baz > > the 'do' and the 'let' both define a new layout context. I was wondering > if anyone had any ideas on how to extend pappy in order to deal with > this? > > the full specification of the layout rule is here in section 9.3: > > http://haskell.org/onlinereport/syntax-iso.html > > > note that it cannot be carried out purely on the lexical level, as it > depends on matching against a 'parse error' in one of its cases. > > adding a primitive that binds to the current column and consumes no > input would be easy, but the "stack of layout contexts" seems to be a > stumbling block. > > John > > > -- > John Meacham - ⑆repetae.net⑆john⑈ > > _______________________________________________ > PEG mailing list > PEG@lists.csail.mit.edu > https://lists.csail.mit.edu/mailman/listinfo/peg _______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg