I don't think the '>>>' would be able to handle the math division notation, but I wrote some code that seems to propose one could extend PEGs for python notation that way just fine though. It's in the attachment of this post named 'regpeg.py'.
The indentation construct would need to be extended if you wanted to handle comments without restrictions. It might help if you figure out some sort of 2D-notation for describing this. Basically it's a block of spaces, extending downwards to cover the whole structure, minus the comments and places where strings capture the newline. I think I'll keep doing it with a tokenizer in my own project. I like to be explicit with spacing only when it matters and that's simple when the strings run through a minimal tokenizer before parsing expression grammars reach them. It's worth discussing here because someone might get a fun idea about how to parse with pegs in two dimensions. On Mon, Nov 25, 2013 at 10:19 AM, Dustin Voss <d_...@mac.com> wrote: > I will also note that this becomes much trickier if you want to parse a > mathematical expression such as this: > > > y > > k = ------ > > x**2 > > > I suppose the only way to do it would actually be to pull out the “>>>” > rule from within the sequence and make it the wrapper of the entire thing, > with a “prefix” sub-rule and an “indented” sub-rule, instead of just the > “indented” sub-rule, e.g., “stmt”. Something like this: > > eqn <- var ‘=‘ >>> expression > > which gets transformed into something more like this, using a functional > notation: > > parse-indentation(prefix: sequence(parse-var, parse-text(“=”)), indented: > parse-expression) > > The indentation parser would have to scan all lines to find and parse the > prefix and verify that it appears alone in the whitespace set aside to the > left of the expression, and then it would scan the expression lines as you > describe. > > Of course, this limits you to one “>>>” per rule. I also do not see an > easy syntax to describe whether the prefix part must appear at the top or > the bottom or the middle of the text. And I’m not sure how the parsed item > would fit into a stream of other text to the left or right of the > expression. > > On Nov 24, 2013, at 10:31 PM, Henri Tuhola <henri.tuh...@gmail.com> wrote: > > > Hi again. > > > > You can already parse indentation with PEG by tokenizing step or > providing context. But if you treat the input such that it holds two > dimensions, shouldn't it be easy to notice that indented block clearly > isn't context sensitive after all? > > > > for i in range(6): > > print(i) > > print(i * 2) > > > > There is very clear pattern here, and you can't really parse the > indentation around the block any other way. So doesn't that mean it can be > done with packrat parser? You only need a certain sort of extra rule for it: > > > > stmt <- 'for' variable 'in' expression >>> stmt > > > > The 'indent' (>>>): > > > > 1. Memorize column index as base-indent. Make sure the line starts with > this structure. > > 2. Match the head pattern. > > 3. Match newline, count spaces until character found. But skip comments. > > 4. Fail if less spacing than what column index dictates. > > 5. Match body pattern. > > 6. Repeat step 3, 4, 5, until first failure, with condition that the > spacing must line up such that it forms a block. > > > > This happens within single block, so it doesn't leak state around. I > think it's perhaps possible to synthesize a 2-D PEG. If someone figures out > a way to do exactly that, you could also try parse: > > > > y > > k = ------ > > x**2 > > > > or this, if earlier one turns out too insane: > > > > k = y > > ------ > > x**2 > > > > I read about someone doing parsing on scanned math expressions. So it > doesn't sound too impossible to consider that this might work just as well. > > _______________________________________________ > > PEG mailing list > > PEG@lists.csail.mit.edu > > https://lists.csail.mit.edu/mailman/listinfo/peg > >
import re class Regex(object): def __init__(self, pattern, flags=0): self.regex = re.compile(pattern, flags) def match(self, string, pos): res = self.regex.match(string, pos) if res is None: return -1, None result = res.groups() if len(result) == 1: return res.end(), result[0] else: return res.end(), result class Sequence(object): def __init__(self, patterns): self.patterns = patterns def match(self, source, pos): current = pos out = [] for pattern in self.patterns: current, result = pattern.match(source, current) if current == -1: return -1, None out.append(result) return current, out class Group(object): def __init__(self, patterns): self.patterns = patterns def match(self, source, pos): for pattern in self.patterns: current, result = pattern.match(source, pos) if current >= 0: return current, result return -1, None class Star(object): def __init__(self, pattern): self.pattern = pattern def match(self, source, pos): current = pos out = [] last, result = self.pattern.match(source, current) while last != -1: out.append(result) current = last last, result = self.pattern.match(source, current) return current, out class Plus(object): def __init__(self, pattern): self.pattern = pattern class Indent(object): def __init__(self, head, body): self.head = head self.body = body def match(self, source, pos): base = pos - source.rfind('\n', 0, pos) - 1 out = [] pos, result = self.head.match(source, pos) if pos == -1: return -1, None out.append(result) pos, indent = match_indent(source, pos) if indent <= base: return -1, None cur, result = self.body.match(source, pos) if cur == -1: return -1, None out.append(result) pos, next_indent = match_indent(source, cur) while indent == next_indent: pos, result = self.body.match(source, pos) if pos == -1: return -1, None cur = pos out.append(result) pos, next_indent = match_indent(source, pos) return cur, out def match_indent(source, pos): indent = 0 if not source.startswith('\n', pos): return -1, -1 while source.startswith('\n', pos): indent = 0 pos += 1 while source.startswith(' ', pos): indent += 1 pos += 1 return pos, indent class Eof(object): def match(self, source, pos): if len(source) == pos: return 0, None else: return -1, None eof = Eof() def construct(pattern): if isinstance(pattern, (str, unicode)): return Regex(pattern) return pattern def sequence(*patterns): return Sequence(map(construct, patterns)) def group(*patterns): return Group(map(construct, patterns)) def star(*patterns): if len(patterns) == 1: return Star(*map(construct, patterns)) else: return Star(Sequence(map(construct, patterns))) def plus(*patterns): if len(patterns) == 1: return Plus(*map(construct, patterns)) else: return Plus(Sequence(map(construct, patterns))) def parse(pattern, source): pos, result = pattern.match(source, 0) if pos == -1: print('syntax error') else: print(result) print('matched %i characters of %i' % (pos, len(source))) if __name__=='__main__': source = """ for i in range(6): print(i) print(i*2) """ spacing = star(r' ') variable = sequence(r'([a-z]+)') number = sequence(r'([0-9]+)') term = group(variable, number) sub_expr = group( sequence(term, r'(\*)', term), term, ) expr = group( sequence(term, r'(\()', sub_expr, r'(\))'), term ) stmt = Indent( sequence(r'(for)', spacing, variable, spacing, r'in', spacing, expr, spacing, ':', spacing), expr ) parse(sequence(r'(\n *)*', stmt, r'(\n *)*'), source)
_______________________________________________ PEG mailing list PEG@lists.csail.mit.edu https://lists.csail.mit.edu/mailman/listinfo/peg