The Marpa engine rewrite

Jeffrey Kegler Wed, 15 Jan 2014 10:19:35 -0800

Now that I'm pretty sure that it will work, let me report on what I'm
up to.  I'm working on (and have completed phase one of) a major rewrite
of Marpa's parse engine, one that will result is a new theory paper.


Before I get into the changes, let me describe very briefly and at a
very high level, the construction of Marpa's parse engine.  It is my
re-combination of ideas which are in four sets.

Idea set 1.) The basic Earley parse engine (from Jay Earley);

Idea set 2.) An improvement that makes Earley's algorithm linear for
right recursion (Joop Leo);

Idea set 3.) A reworking by Aycock&Horspool.

Idea set 4.) My own reworking of the algorithm, which may enables you
to "pause" the parse engine.  You could stop the engine and resume it
efficiently and at a point where you have full knowledge of what you've
done so far and what is possible going forward.  At the point you can
actually change things in response to what you see, before resuming.
(This particular property is totally new with Marpa, and is what makes
many of its special features possible.)

The current reworking keeps the other ideas from the other 3 sources, but
drops the most visible idea from Aycock & Horspool: using LR(0) states
to group Earley items together, in order to make Earley sets smaller.
Earley parsing is table-driven and traditionally each  "Earley set"
in the table had one entry per dotted rule.  (A dotted rule is a rule,
with a location marked with a dot.  The location may be before the rule,
after the rule, or between two symbols.)  Aycock & Horspool noted that the
same dotted rules often occur together, a fact which yacc has leveraged
in the past, and that Earley's algorithm could be modified to use this,
making Earley tables smaller.

Getting to the point: I am dropping use of the LR(0) states and in part
going back to traditional Earley dotted-rules, in part going on to a
new and different idea.

As I worked with Marpa, I noted that while the Aycock & Horspool
LR(0) based items did have advantages, they came with a number of big
disadvantages: when it came to evaluation, the LR(0) states had to be
pulled apart, back into dotted rule form.  Transitions from one LR(0)
state to another were not nearly as simple as moving the dot forward in
a dotted rule.  Worse, they could result in duplication of dotted rules
in different states, necessitating special code to deal with it.

And, finally, the LR(0) states made it very difficult to take advantage
of the fourth idea set: examining and altering the parse tables during
"pauses".  This was an major obstacle to innovation.  Events, the "Ruby
Slippers" and ability to mix procedural parsing with Earley parsing are
all current results of this fourth idea set.  And there is more to come,
but the LR(0) states would have been a continual obstacle.

States in the parser are of two types: One type is predicted, meaning
you'd determined that a rule could start at the current position.
The other type means that you've actually recognized some symbols that
could be part of the rule.  I call those states "discovered".

Experience with Marpa has given me evidence on how the Aycock and
Horspool states work.  I'll give numbers from parsers for Perl, HTML
and C.  The average number of dotted rules in every "predicted" LR(0)
states is 83.59 (Perl), 14.60(HTML), and 54.81 (C).  The average number
of dotted rules in "discovered" LR(0) states is 1.5235 (Perl), 1.1904
(HTML), and 1.52 (C).  LR(0) states turned out to be a huge implementation
nuisance, but for prediction states they were clear advantages. On the
other hand, for practical grammars, discovered states turned out to not
be worth the trouble.

I will be eliminating use of LR(0) states for *both* predicted and
discovered states, but in different ways.  For discovered states,
I will simple go back to the traditional "one dotted rule per item"
approach.  Every other consideration was already against LR(0) states.
When discovered LR(0) states turned out to be of little use for practical
grammars, no justification was left for keeping them.

For predicted states, I will take another approach.  Predicted Earley
items are quite special.  In addition to dotted rule, Earley items have
to track the starting location of their rule and, to order to assist
evaluation, they have to track history.  Predictions are special in that
they have no history, and the starting location of their rule is always
the same as the current location.  Finally, the dot in their dotted
rules is always at the beginning of the rule.

The number of rules is a constant, based on the grammar, and this means
that predictions in each Earley set can be tracked as a set of integers,
where the integers have a very reasonable maximum.  A bit vector would
work well, though currently I plan an alternative representation.

Ok.  So where am I in all this.  I'm working in phases.  Phase one is
changing discovered Earley items so that they always contain exactly one
dotted rule -- in effect eliminating use of the LR(0) states for them.
This part is coded and tested.

In the next phases, I will implement my new ideas for handling
predictions.  I expect that these phases will allow me to introduce new
Marpa features.  Results visible to the user may appear as soon as the
second phase.

I plan to bring out a developer's version based on Phase 1 in a few days.
Phase 1 is complete and tested, but in the process of development, I
rearranged my test suite.  This rearrangement worked fine for development,
but it needs to be put back in proper order for CPANtesters.

Many thanks for your support and patience, Jeffrey

--
You received this message because you are subscribed to the Google Groups "marpa 
parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

The Marpa engine rewrite

Reply via email to