Re: Parsing indent-sensitive languages

Peri Hankey Fri, 09 Sep 2005 02:41:35 -0700

Dave Whipp wrote:

If I want to parse a language that is sensitive to whitespaceindentation (e.g. Python, Haskell), how do I do it using P6 rules/grammars?
The way I'd usually handle it is to have a lexer that examines leadingwhitespace and converts it into "indent" and "unindent" tokens. Thegrammer can then use these tokens in the same way that it would anyother block-delimiter.
This requires a stateful lexer, because to work out the number of"unindent" tokens on a line, it needs to know what the indentationpositions are. How would I write a P6 rule that defines <indent> and<unindent> tokens? Alternatively (if a different approach is needed) howwould I use P6 to parse such a language?

In this context, I thought readers of this list might be interested inthe following extract from mediawiki.lmn, a ruleset for generating htmlpages from a subset of mediawiki markup. These rules are written in lmn,the metalanguage of the language machine, and the extract deals withunordered and ordered lists, where entries are prefixed by '*' and '#'characters, and repeated prefix characters indicate nesting.

NB the source text of lmn rules is written using a subset of themediawiki markup, with preformatted text (lines that start with at leastone space) treated as actual source with no markup and everything elsetreated as annotation:


----------------- start of extract from mediawiki.lmn ------------------
== bulleted and numbered lists ==

Unordered and ordered lists are a bit tricky - essentially they are likeindented blocks in Python, but a little more complex because of the wayordered and unordered lists can be combined with each other. Thesolution is that at each level, the prefix pattern of '#' and '*'characters is known, and the level continues while that pattern isrecognised. This can be done by matching the value of a variable whichholds the pattern for the current level.


    '*'                                  <- unit - ulist :'*';
    '#'                                  <- unit - olist :'#';
    ulist :A item :X repeat more item :Y <- unit ul :{X each Y} eom;
    olist :A item :X repeat more item :Y <- unit ol :{X each Y} eom;

    '*'                                  <- item - ulist :{A'*'};
    '#'                                  <- item - olist :{A'#'};
    ulist :A item :X repeat more item :Y <- item :{ ul :{X each Y}};
    olist :A item :X repeat more item :Y <- item :{ ol :{X each Y}};
    - wikitext :X                        <- item :{ li :X };

The following rule permits a level to continue as long as the inputmatches the current prefix. We recurse for each level before gettinghere, so we will always try to match the innermost levels first - theyhave the longest prefix strings, and so there is no danger of apremature match


    - A                                  <- more ;
-----------------  end of extract from mediawiki.lmn  ------------------

The complete ruleset can be seen at:
http://languagemachine.sourceforge.net/website.html    - summary
http://languagemachine.sourceforge.net/mediawiki.html  - markup
http://languagemachine.sourceforge.net/sitehtml.html   - wrappings

I have fairly recently published the language machine under Gnu GPL atsourceforge. It consists of a minimal main program, a shared librarywritten in D using the gdc frontend to gnu gcc, and several flavours ofan lmn metalanguage compiler - these are all written in lmn and share acommon frontend.

The metalanguage compiler sources are on the website (with many otherexamples) as web pages that have been generated directly from lmn sourcetext by applying the markup-to-html translation rules.

The language machine in previous incarnations has a long history, but itis not much like any other language toolkit that I know of. This is apage that relates it to received wisdom about language and languageimplementations:


http://languagemachine.sourceforge.net/grammar.html

There is an extremely useful diagram which shows what happens whenunrestricted grammatical substitution rules are applied to an inputstream - this is explained here in relation to a couple of triviallysimple examples:


http://languagemachine.sourceforge.net/lm-diagram.html

My intention in creating this implementation has been to make somethingthat can be combined with other free languages and toolchains, and Ihave recently asked the grants-secretary at the Perl Foundation forfeedback on a draft proposal to create a language machine extension forperl.


I would be very interested to hear what you think.

Regards
Peri Hankey

--
http://languagemachine.sourceforge.net - The language machine

Re: Parsing indent-sensitive languages

Reply via email to