Re: [bitc-dev] FParsec and Parser Combinators more generally

Jonathan S. Shapiro Mon, 18 May 2015 08:43:28 -0700

On Fri, May 15, 2015 at 12:40 PM, Matt Rice <ratm...@gmail.com> wrote:

> On Wed, May 13, 2015 at 7:38 AM, Jonathan S. Shapiro <s...@eros-os.org>
> wrote:
>
> >     type Parser<'TResult, 'TUserState> = CharStream<'TUserState> ->
> > Reply<'TResult>
> >
> > But there is nothing magical about this definition. While the developer
> > *can* introduce new compositing operators, they typically do not do so.
> If
> > we take the list of compositing operators as non-extensible,
>
> I guess i'm not sure how mixfix fits into this...

It doesn't.

Mixfix divides a conventional parser into a static parser and a dynamic
parser. In most cases, the static parser doesn't have ability to deal with
dynamically defined operator precedence, and therefore cannot deal with the
syntax changes introduced dynamically by the mixfix mechanism. The result
is that the static parser merely collects the expression tokens into a
sequence (e.g. a list) and the dynamic parser then produces a parse tree
for the expression.

Now that you have me thinking about it, though, it *would* be possible to
design an operator precedence combinator-style parser in which the actual
precedence of the operators can be expressed dynamically. I had not
previously considered this.

I'm initially hesitant, for two mundane reasons and one serious one. The
mundane reasons:

1. Making this all work in BitC required introducing some special-case
mixfix operators into the master mixfix table so that "built in" expression
syntax (mainly application) could be handled in the dynamic parser. I'm not
sure how that plays out if we do it with a combinator. Probably easy to
check.

2. I already have code that does what I think I want (from the v0
compiler), and there is a degree of "if it ain't broke..." in my mind.
Also, I'm not very experienced with combinators and I don't want to screw
this up. I should probably look at some existing operator precedence
combinators and see if the MixFix engine can be expressed that way.

The more serious concern:

If we merge the static and dynamic parsers by building a custom combinator,
we lose the ability to analyze the grammar for parse ambiguities.

[Late addition: I now see how to avoid this issue, and I'm pretty convinced
that I can straightforwardly turn the existing v0 mixfix parser into a
custom combinator. In fact, turning it into a combinator in the right way
resolves the "can't analyze" issue.]

The ambiguity analysis issue isn't just a mundane issue. An unambiguous
grammar guarantees a linear parse, where a general CFG is O(n^3). That's a
difference that actually matters in practice. One of the weaknesses of the
current BitC mixfix parser is that it uses backtracking and is therefore
[potentially] O(n^3). I actually did some work in there to prune the search
space, but I need to do some more work so that AST tree rearrangement in
the presence of certain precedence combinations doesn't necessitate
backtracking. The necessary restructuring is conceptually similar to what
happens with balanced binary trees, but with a different ordering criteria.
IIRC, the necessary tree rearrangement is actually described in the early
operator precedence parsing papers. I think by Aho, but if not we can use
some of the ideas that Vaughan Pratt used in top-down operator precedence
parsing. It wasn't worth the bother for a quick implementation, and what
with expressions being generally very short I wasn't that worried about it
in practice for a v0 compiler.

Optimizing mixfix using conventional optimizations seems a tricky business.
Brzozozwksi derivatives can be computed incrementally (which is in fact how
many modern regexp engines work). They have an extension to CFG's (Might,
Derais, and Spiewak), and it *appears* to me that the extension can be made
to operate incrementally. Especially in the special case of mixfix, where
we know by construction that productions are never nullable. The catch is
that the set of productions introduced by mixfix is constantly changing,
and every time they change you need to revise any derivatives that you have
already built. The simplest thing is to just flush the derivative cache,
but at that point you've thrown away most of your optimization.

So it doesn't seem at all clear what to do about optimizing that, nor is it
clear that it's worth bothering. Might make a good undergraduate project
for someone, since it's fairly well self-contained.

Whether we optimize it or not, the presence of mixfix introduces a black
box into the parse ambiguity analysis. We can do a meta-analysis to show
that the sequence of tokens that participate in a mixfix sub-parse can be
captured without ambiguity. The existing v0 parser does that already. But
we then have to assert a leap of faith that the operator precedence parser
is linear time (which we know it can be, so maybe that's not bad).

...it seems like with
> mixfix you'd want to check for ambiguity at the time the parser is
> compiled, but then also at the time when the mixfix operators have
> been collected?  does mixfix not fall into this realm of /they
> typically do not do so/, or maybe its just a case of reserved words
> are used in such a way that mixfix cannot introduce an ambiguity

Mixfix cannot introduce ambiguity, but the reason isn't immediately
obvious. Here are the pieces:

t. The expression sub-grammar is completely partitioned from the
surrounding grammar, and (by design) there is no overlap in the associated
keywords. Therefore, the main grammar can be examined for ambiguity by
viewing the expression non-terminal as "obtained by magic".

2. The mixfix introduction form is (by construction) a fully bracketed
expression. This means that the operator precedence table is fixed for the
duration of any given expression subtree. Which in turn means that you
don't need to consider anything weird like precedence changing on the fly
or keywords being rebound on the fly in mid-expression. That's actually
pretty important. :-)

3. The mixfix sub-grammars are (a) non-nullable, (b) have (by construction)
non-overlapping keywords. The only way you can really get in trouble with
them is to have two mixfix definitions with identical precedence. That's
generally a design mistake on the part of the programmer, and it can either
be resolved by fiat (e.g. by lexical ordering of the tokens) or by
signaling a compile-time error.

shap

_______________________________________________
bitc-dev mailing list
bitc-dev@coyotos.org
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] FParsec and Parser Combinators more generally

Reply via email to