On 2007-07-02, Paul McGuire <[EMAIL PROTECTED]> wrote: > On Jul 2, 3:56 pm, Neil Cerutti <[EMAIL PROTECTED]> wrote: > from pyparsing import *
It's always good when your messages start like that. ;) > """ > Ok, here is the step-by-step, beginning with your posted BNF. > (Based on your test cases, I think the '{}'s are really > supposed to be '()'s.) Yeah, the Scheme version of WAE uses curlies in examples just so it looks slightly different from Scheme, although since the Scheme version is built on the Scheme read function, it actually accepts several different kinds of delimiters. Python didn't need to do that, I thought. My first impulse when programming this exercise was to ape the Scheme strategy, going with a syntax analogous to Python's, using Python's code or AST modules. But it turns out I'm not a clever enough language designer. Moreover, the fun of the book I mentioned is in designing the semantics of the programs. The book basically punts parsing, leaving it up to the 'read' function. So basically, Python gets up to speed (except for the define-type and type-case macros) simply by implementing a read with enough functionality for each mini-langauge. > ; <WAE> ::= > ; <num> > ; | { + <WAE> <WAE> } > ; | { - <WAE> <WAE> } > ; | {with {<id> <WAE>} <WAE>} > ; | <id> > > The most basic building blocks in pyparsing are Literal and > Word. With these, you compose "compound" elements using And and > MatchFirst, which are bound to the operators '+' and '|' (on > occasion, Or is required, bound to operator '^', but not for > this simple parser). Since you have a recursive grammar, you > will also need Forward. Whitespace is skipped implicitly. > > Only slightly more advanced is the Group class, which will > impart hierarchy and structure to the results - otherwise, > everything just comes out as one flat list of tokens. You may > be able to remove these in the final parser, depending on your > results after steps 1 and 2 in the "left for the student" part > below, but they are here to help show structure of the parsed > tokens. > > As convenience functions go, I think the most common are oneOf > and delimitedList. oneOf might be useful here if you want to > express id as a single-char variable; otherwise, just use > Word(alphas). > > At this point you should be able to write a parser for this WAE > grammar. Like the following 9-liner: > """ > > LPAR = Literal("(").suppress() > RPAR = Literal(")").suppress() > > wae = Forward() > num = Word(nums) > id = oneOf( list(alphas) ) The above shadows 'id'; I suppose 'ident' would be better. > addwae = Group( LPAR + "+" + wae + wae + RPAR ) > subwae = Group( LPAR + "-" + wae + wae + RPAR ) > withwae = Group( LPAR + "with" + LPAR + id + wae + RPAR + wae + RPAR ) > > wae << (addwae | subwae | withwae | num | id) > > tests = """\ > 3 > (+ 3 4) > (with (x (+ 5 5)) (+ x x))""".splitlines() > > for t in tests: > print t > waeTree = wae.parseString(t) > print waeTree.asList() > print > > """ > If you extract and run this script, here are the results: > 3 > ['3'] > > (+ 3 4) > [['+', '3', '4']] > > (with (x (+ 5 5)) (+ x x)) > [['with', 'x', ['+', '5', '5'], ['+', 'x', 'x']]] How can I make it barf for testcases like '(+ 2 3))'? It doesn't seem to expect an Eof. > Left as an exercise for the student: > 1. Define classes NumWAE, IdWAE, AddWAE, SubWAE, and WithWAE whose > __init__ methods take a ParseResults object named tokens (which you > can treat as a list of tokens), and each with a calc() method to > evaluate them accordingly. > 2. Hook each class to the appropriate WAE class using setParseAction. > Hint: here is one done for you: num.setParseAction(NumWAE) > 3. Modify the test loop to insert an evaluation of the parsed tree. I use doctest, so it looks quite different. On the other hand, it actually checks that the results are correct. ;) > Extra credit: why is id last in the set of alternatives defined > for the wae expression? I'm not sure. When I moved it to first all my valid tests still passed, but one of the deliberately erroneous ones caused a different exception, e.g., '(+ 2)'. Writing my own parser does make error handling more predictable, but probably PyParsing can be configured to do what I want. My AST's from the first version of the program translated easily into your program, with almost no muss or fuss. The muss is that, since all the __init__ functions now expect a token list instead of named arguments, they are now cryptic, and creating AST's manually became annoying. The fuss is that I do have to create one in With's calc function. It should be unnecessary for the AST objects to be so dependent upon the grammar to work correctly. I suppose a solution would be to add another layer of abstraction in between. Read it and weep. The program hasn't changed much. It's still uglier than the Scheme. Attached at the end is my original sexp and WAE parser, for comparison with the PyParsing portion of the program. This time I've included all the tests, but no other useful docstrings. """ A PyParsing solution for WAE, composed by Paul MaGuire, with semantics by Neil. >>> parse('3') <Num 3> >>> parse('(+ 3 45)') <Add <Num 3> <Num 45>> >>> parse('(with (x (+ 5 5)) (+ x x))') <With <Id x> <Add <Num 5> <Num 5>> <Add <Id x> <Id x>>> >>> parse('3').calc() 3 >>> parse('(+ 2)') Traceback (most recent call last): ... ParseException: Expected "(" (at char 4), (line:1, col:5) >>> parse('(+ 3 45)').calc() 48 >>> parse('(- 3 2)').calc() 1 >>> parse('(- (+ 8 4) 4)').calc() 8 >>> parse('(with (x (+ 5 5)) (+ x x))').calc() 20 >>> parse('(with (x 5) (+ 2 x))').calc() 7 >>> parse('(with (x x) x)').calc() Traceback (most recent call last): ... SyntaxError: Free identifier """ import operator from pyparsing import * class Num(object): def __init__(self, tokens): self.number = int(tokens[0]) def __repr__(self): return '<Num %d>' % self.number def subst(self, sub_id, value): return self def calc(self): return self.number class Id(object): def __init__(self, tokens): self.name = tokens[0] def __repr__(self): return '<Id %s>' % self.name def __eq__(self, an_id): return an_id.name == self.name def subst(self, sub_id, value): if sub_id == self: return value else: return self def calc(self): raise SyntaxError('Free identifier') class BinOp(object): def __init__(self, tokens): self.op, self.name = {'+': (operator.add, 'Add'), '-': (operator.sub, 'Sub')}[tokens[0][0]] self.lhs = tokens[0][1] self.rhs = tokens[0][2] def __repr__(self): return '<%s %r %r>' % (self.name, self.lhs, self.rhs) def subst(self, sub_id, value): self.lhs = self.lhs.subst(sub_id, value) self.rhs = self.rhs.subst(sub_id, value) return self def calc(self): return self.op(self.lhs.calc(), self.rhs.calc()) class With(object): def __init__(self, tokens): self.bound_id = tokens[0][1] self.named_expr = tokens[0][2] self.bound_body = tokens[0][3] def __repr__(self): return '<With %s %r %r>' % (self.bound_id, self.named_expr, self.bound_body) def subst(self, sub_id, value): self.named_expr = self.named_expr.subst(sub_id, value) if sub_id != self.bound_id: self.bound_body = self.bound_body.subst(sub_id, value) return self def calc(self): tokens = [self.named_expr.calc()] return self.bound_body.subst(self.bound_id, Num(tokens)).calc() LPAR = Literal("(").suppress() RPAR = Literal(")").suppress() wae = Forward() num = Word(nums).setParseAction(Num) ident = Word(alphas).setParseAction(Id) addwae = Group( LPAR + "+" + wae + wae + RPAR ).setParseAction(BinOp) subwae = Group( LPAR + "-" + wae + wae + RPAR ).setParseAction(BinOp) withwae = Group( LPAR + "with" + LPAR + ident + wae + RPAR + wae + RPAR ).setParseAction(With) wae << (addwae | subwae | withwae | num | ident) def parse(s): return wae.parseString(s).asList()[0] if __name__ == '__main__': import doctest doctest.testmod() Below is my simple sexp parser and wae adapter (most tests and some docs removed for brevity). The hand-written scanner is a result of the horrible performance and readability of the regex-based scanners I experimented with. """ sexp.py A compiler of s-expressions into tuples. Functions provided: read(sexp) Convert a Python str that represents one s-expression into a tuple. For example, >>> read('(+ 4 5)') ('+', 4, 5) >>> read('(+ 4 5 (- 8))') ('+', 4, 5, ('-', 8)) Only s-expressions, symbols, and integers are recognized. Only one expression may appear in the string. Trailing characters result in an exception. >>> read('(first experssion) (second expression)') Traceback (most recent call last): ... ParseError: Expected EOF, got ( EBNF: goal -> [ <expr> ] EOF sexp -> '(' { <expr> } ')' expr -> <sexp> | INTEGER |SYMBOL """ def read(sexp): """ Read an s-expression (in the form of a string), and return a list representing its parse tree. >>> read('') >>> read('400') 400 This parser readly only one expression at a time, unlike a scheme reader. >>> read('23 skidoo') Traceback (most recent call last): ... ParseError: Expected EOF, got SYMBOL >>> read('()') () >>> read('(()') Traceback (most recent call last): ... ParseError: Expected ), got EOF >>> read('(+ 4 5)') ('+', 4, 5) """ return Parser(scan(sexp)).parse() class ParseError(SyntaxError): pass class Parser(object): """ Parser for simple s-expressions. BNF: goal ::= <opt_expr> EOF opt_expr ::= <expr> opt_expr ::= ~ expr ::= <sexp> expr ::= INTEGER expr ::= SYMBOL sexp ::= '(' <opt_expr_list> ')' opt_expr_list ::= <expr_list> opt_expr_list ::= ~ expr_list ::= <expr> <expr_tail> expr_tail ::= <expr> <expr_tail> expr_tail ::= ~ """ def __init__(self, iterable): iterator = iter(iterable) self.token, self.value = iterator.next() self.next = iterator.next def match(self, want): if want != self.token: raise ParseError("Expected %s, got %s" % (want, self.token)) value = self.value if self.token != 'EOF': self.token, self.value = self.next() return value def parse(self): return self.goal() def goal(self): ast = self.opt_expr() self.match('EOF') return ast def opt_expr(self): if self.token == 'EOF': return None else: return self.expr() def opt_expr_list(self): if self.token in ['(', 'INTEGER', 'SYMBOL']: return self.expr_list() else: return () def expr_list(self): return (self.expr(),) + self.expr_tail() def expr_tail(self): if self.token in ['(', 'INTEGER', 'SYMBOL']: return (self.expr(),) + self.expr_tail() else: return () def expr(self): if self.token == '(': return self.sexp() elif self.token == 'INTEGER': return self.match('INTEGER') elif self.token == 'SYMBOL': return self.match('SYMBOL') def sexp(self): self.match('(') ast = self.opt_expr_list() self.match(')') return ast # EOF is the end of the input stream. # INTEGER is '[0-9]+' # SYMBOL is '[^0-9][^ ]*' def scan(sexp): """ Yield the tokens in sexp, in the order they are encountered. >>> for token in scan('()'): ... print token ('(', '(') (')', ')') ('EOF', 'EOF') >>> for token in scan('(+ 5 3)'): ... print token ('(', '(') ('SYMBOL', '+') ('INTEGER', 5) ('INTEGER', 3) (')', ')') ('EOF', 'EOF') >>> for token in scan('12a'): ... print token Traceback (most recent call last): ... ParseError: Ill-formed INTEGER """ ix = 0 length = len(sexp) while ix < length: if sexp[ix] in '()': yield (sexp[ix], sexp[ix]) ix += 1 elif sexp[ix] in '0123456789': jx = ix+1 while jx < length and sexp[jx] in '0123456789': jx += 1 if jx < length and sexp[jx] not in ' \t\n()': raise ParseError("Ill-formed INTEGER") yield ('INTEGER', int(sexp[ix:jx])) ix = jx elif sexp[ix] not in ' \t\n': jx = ix+1 while jx < length and sexp[jx] not in ' \t\n)(': jx += 1 yield ('SYMBOL', sexp[ix:jx]) ix = jx else: ix += 1 yield ('EOF', 'EOF') if __name__ == '__main__': import doctest doctest.testmod() And finally, the WAE adapter, which is not necessarily required by the PyParsing solution. def parse(expr): """ Compile a string representing an WAE expression into an object that represents the abstract syntax tree of that expression. The AST provides a calc method that returns the result of evaluating the expression. >>> parse('(+ 4 5)') <Add <Num 4> <Num 5>> """ def parse_ast(ast): if isinstance(ast, int): return Num(ast) elif isinstance(ast, str): return Id(ast) elif isinstance(ast, tuple): op = ast[0] if op in ['+', '-'] and len(ast) == 3: if op == '+': return Add(parse_ast(ast[1]), parse_ast(ast[2])) elif op == '-': return Sub(parse_ast(ast[1]), parse_ast(ast[2])) elif op == 'with' and len(ast) == 3 and len(ast[1]) == 2: return With(ast[1][0], parse_ast(ast[1][1]), parse_ast(ast[2])) raise SyntaxError("Ill-formed expression") return parse_ast(sexp.read(expr)) The isinstance calls are quite unfortunate compared to type-case. For example, type-case will signal an error if I haven't handled all possible cases. -- Neil Cerutti -- http://mail.python.org/mailman/listinfo/python-list