Re: [PEG] Easy way to parse indented syntax by adding dimension?

Henri Tuhola Mon, 25 Nov 2013 02:04:32 -0800

I don't think the '>>>' would be able to handle the math division notation,
but I wrote some code that seems to propose one could extend PEGs for
python notation that way just fine though. It's in the attachment of this
post named 'regpeg.py'.


The indentation construct would need to be extended if you wanted to handle
comments without restrictions. It might help if you figure out some sort of
2D-notation for describing this. Basically it's a block of spaces,
extending downwards to cover the whole structure, minus the comments and
places where strings capture the newline.

I think I'll keep doing it with a tokenizer in my own project. I like to be
explicit with spacing only when it matters and that's simple when the
strings run through a minimal tokenizer before parsing expression grammars
reach them. It's worth discussing here because someone might get a fun idea
about how to parse with pegs in two dimensions.



On Mon, Nov 25, 2013 at 10:19 AM, Dustin Voss <d_...@mac.com> wrote:

> I will also note that this becomes much trickier if you want to parse a
> mathematical expression such as this:
>
> >       y
> > k = ------
> >     x**2
>
>
> I suppose the only way to do it would actually be to pull out the “>>>”
> rule from within the sequence and make it the wrapper of the entire thing,
> with a “prefix” sub-rule and an “indented” sub-rule, instead of just the
> “indented” sub-rule, e.g., “stmt”. Something like this:
>
> eqn <- var ‘=‘ >>> expression
>
> which gets transformed into something more like this, using a functional
> notation:
>
> parse-indentation(prefix: sequence(parse-var, parse-text(“=”)), indented:
> parse-expression)
>
> The indentation parser would have to scan all lines to find and parse the
> prefix and verify that it appears alone in the whitespace set aside to the
> left of the expression, and then it would scan the expression lines as you
> describe.
>
> Of course, this limits you to one “>>>” per rule. I also do not see an
> easy syntax to describe whether the prefix part must appear at the top or
> the bottom or the middle of the text. And I’m not sure how the parsed item
> would fit into a stream of other text to the left or right of the
> expression.
>
> On Nov 24, 2013, at 10:31 PM, Henri Tuhola <henri.tuh...@gmail.com> wrote:
>
> > Hi again.
> >
> > You can already parse indentation with PEG by tokenizing step or
> providing context. But if you treat the input such that it holds two
> dimensions, shouldn't it be easy to notice that indented block clearly
> isn't context sensitive after all?
> >
> > for i in range(6):
> >     print(i)
> >     print(i * 2)
> >
> > There is very clear pattern here, and you can't really parse the
> indentation around the block any other way. So doesn't that mean it can be
> done with packrat parser? You only need a certain sort of extra rule for it:
> >
> > stmt <- 'for' variable 'in' expression >>> stmt
> >
> > The 'indent' (>>>):
> >
> >  1. Memorize column index as base-indent. Make sure the line starts with
> this structure.
> >  2. Match the head pattern.
> >  3. Match newline, count spaces until character found. But skip comments.
> >  4. Fail if less spacing than what column index dictates.
> >  5. Match body pattern.
> >  6. Repeat step 3, 4, 5, until first failure, with condition that the
> spacing must line up such that it forms a block.
> >
> > This happens within single block, so it doesn't leak state around. I
> think it's perhaps possible to synthesize a 2-D PEG. If someone figures out
> a way to do exactly that, you could also try parse:
> >
> >        y
> > k = ------
> >      x**2
> >
> > or this, if earlier one turns out too insane:
> >
> > k = y
> >      ------
> >      x**2
> >
> > I read about someone doing parsing on scanned math expressions. So it
> doesn't sound too impossible to consider that this might work just as well.
> > _______________________________________________
> > PEG mailing list
> > PEG@lists.csail.mit.edu
> > https://lists.csail.mit.edu/mailman/listinfo/peg
>
>

import re

class Regex(object):
    def __init__(self, pattern, flags=0):
        self.regex = re.compile(pattern, flags)

    def match(self, string, pos):
        res = self.regex.match(string, pos)
        if res is None:
            return -1, None
        result = res.groups()
        if len(result) == 1:
            return res.end(), result[0]
        else:
            return res.end(), result

class Sequence(object):
    def __init__(self, patterns):
        self.patterns = patterns

    def match(self, source, pos):
        current = pos
        out = []
        for pattern in self.patterns:
            current, result = pattern.match(source, current)
            if current == -1:
                return -1, None
            out.append(result)
        return current, out

class Group(object):
    def __init__(self, patterns):
        self.patterns = patterns

    def match(self, source, pos):
        for pattern in self.patterns:
            current, result = pattern.match(source, pos)
            if current >= 0:
                return current, result
        return -1, None


class Star(object):
    def __init__(self, pattern):
        self.pattern = pattern

    def match(self, source, pos):
        current = pos
        out = []
        last, result = self.pattern.match(source, current)
        while last != -1:
            out.append(result)
            current = last
            last, result = self.pattern.match(source, current)
        return current, out

class Plus(object):
    def __init__(self, pattern):
        self.pattern = pattern

class Indent(object):
    def __init__(self, head, body):
        self.head = head
        self.body = body

    def match(self, source, pos):
        base = pos - source.rfind('\n', 0, pos) - 1
        out = []
        pos, result = self.head.match(source, pos)
        if pos == -1:
            return -1, None
        out.append(result)
        pos, indent = match_indent(source, pos)
        if indent <= base:
            return -1, None
        cur, result = self.body.match(source, pos)
        if cur == -1:
            return -1, None
        out.append(result)
        pos, next_indent = match_indent(source, cur)
        while indent == next_indent:
            pos, result = self.body.match(source, pos)
            if pos == -1:
                return -1, None
            cur = pos
            out.append(result)
            pos, next_indent = match_indent(source, pos)
        return cur, out
        
def match_indent(source, pos):
    indent = 0
    if not source.startswith('\n', pos):
        return -1, -1
    while source.startswith('\n', pos):
        indent = 0
        pos += 1
        while source.startswith(' ', pos):
            indent += 1
            pos += 1
    return pos, indent



class Eof(object):
    def match(self, source, pos):
        if len(source) == pos:
            return 0, None
        else:
            return -1, None

eof = Eof()

def construct(pattern):
    if isinstance(pattern, (str, unicode)):
        return Regex(pattern)
    return pattern

def sequence(*patterns):
    return Sequence(map(construct, patterns))

def group(*patterns):
    return Group(map(construct, patterns))

def star(*patterns):
    if len(patterns) == 1:
        return Star(*map(construct, patterns))
    else:
        return Star(Sequence(map(construct, patterns)))

def plus(*patterns):
    if len(patterns) == 1:
        return Plus(*map(construct, patterns))
    else:
        return Plus(Sequence(map(construct, patterns)))

def parse(pattern, source):
    pos, result = pattern.match(source, 0)
    if pos == -1:
        print('syntax error')
    else:
        print(result)
        print('matched %i characters of %i' % (pos, len(source)))

if __name__=='__main__':
    source = """
    for i in range(6):
        print(i)
        print(i*2)
    """


    spacing = star(r' ')

    variable = sequence(r'([a-z]+)')
    number = sequence(r'([0-9]+)')

    term = group(variable, number)

    sub_expr = group(
        sequence(term, r'(\*)', term),
        term,
    )

    expr = group(
        sequence(term, r'(\()', sub_expr, r'(\))'),
        term
    )

    stmt = Indent(
        sequence(r'(for)', spacing, variable, spacing, r'in', spacing, expr, spacing, ':', spacing),
        expr
    )

    parse(sequence(r'(\n *)*', stmt, r'(\n *)*'), source)

_______________________________________________
PEG mailing list
PEG@lists.csail.mit.edu
https://lists.csail.mit.edu/mailman/listinfo/peg

Re: [PEG] Easy way to parse indented syntax by adding dimension?

Reply via email to