Now that I'm pretty sure that it will work, let me report on what I'm up to. I'm working on (and have completed phase one of) a major rewrite of Marpa's parse engine, one that will result is a new theory paper.
Before I get into the changes, let me describe very briefly and at a very high level, the construction of Marpa's parse engine. It is my re-combination of ideas which are in four sets. Idea set 1.) The basic Earley parse engine (from Jay Earley); Idea set 2.) An improvement that makes Earley's algorithm linear for right recursion (Joop Leo); Idea set 3.) A reworking by Aycock&Horspool. Idea set 4.) My own reworking of the algorithm, which may enables you to "pause" the parse engine. You could stop the engine and resume it efficiently and at a point where you have full knowledge of what you've done so far and what is possible going forward. At the point you can actually change things in response to what you see, before resuming. (This particular property is totally new with Marpa, and is what makes many of its special features possible.) The current reworking keeps the other ideas from the other 3 sources, but drops the most visible idea from Aycock & Horspool: using LR(0) states to group Earley items together, in order to make Earley sets smaller. Earley parsing is table-driven and traditionally each "Earley set" in the table had one entry per dotted rule. (A dotted rule is a rule, with a location marked with a dot. The location may be before the rule, after the rule, or between two symbols.) Aycock & Horspool noted that the same dotted rules often occur together, a fact which yacc has leveraged in the past, and that Earley's algorithm could be modified to use this, making Earley tables smaller. Getting to the point: I am dropping use of the LR(0) states and in part going back to traditional Earley dotted-rules, in part going on to a new and different idea. As I worked with Marpa, I noted that while the Aycock & Horspool LR(0) based items did have advantages, they came with a number of big disadvantages: when it came to evaluation, the LR(0) states had to be pulled apart, back into dotted rule form. Transitions from one LR(0) state to another were not nearly as simple as moving the dot forward in a dotted rule. Worse, they could result in duplication of dotted rules in different states, necessitating special code to deal with it. And, finally, the LR(0) states made it very difficult to take advantage of the fourth idea set: examining and altering the parse tables during "pauses". This was an major obstacle to innovation. Events, the "Ruby Slippers" and ability to mix procedural parsing with Earley parsing are all current results of this fourth idea set. And there is more to come, but the LR(0) states would have been a continual obstacle. States in the parser are of two types: One type is predicted, meaning you'd determined that a rule could start at the current position. The other type means that you've actually recognized some symbols that could be part of the rule. I call those states "discovered". Experience with Marpa has given me evidence on how the Aycock and Horspool states work. I'll give numbers from parsers for Perl, HTML and C. The average number of dotted rules in every "predicted" LR(0) states is 83.59 (Perl), 14.60(HTML), and 54.81 (C). The average number of dotted rules in "discovered" LR(0) states is 1.5235 (Perl), 1.1904 (HTML), and 1.52 (C). LR(0) states turned out to be a huge implementation nuisance, but for prediction states they were clear advantages. On the other hand, for practical grammars, discovered states turned out to not be worth the trouble. I will be eliminating use of LR(0) states for *both* predicted and discovered states, but in different ways. For discovered states, I will simple go back to the traditional "one dotted rule per item" approach. Every other consideration was already against LR(0) states. When discovered LR(0) states turned out to be of little use for practical grammars, no justification was left for keeping them. For predicted states, I will take another approach. Predicted Earley items are quite special. In addition to dotted rule, Earley items have to track the starting location of their rule and, to order to assist evaluation, they have to track history. Predictions are special in that they have no history, and the starting location of their rule is always the same as the current location. Finally, the dot in their dotted rules is always at the beginning of the rule. The number of rules is a constant, based on the grammar, and this means that predictions in each Earley set can be tracked as a set of integers, where the integers have a very reasonable maximum. A bit vector would work well, though currently I plan an alternative representation. Ok. So where am I in all this. I'm working in phases. Phase one is changing discovered Earley items so that they always contain exactly one dotted rule -- in effect eliminating use of the LR(0) states for them. This part is coded and tested. In the next phases, I will implement my new ideas for handling predictions. I expect that these phases will allow me to introduce new Marpa features. Results visible to the user may appear as soon as the second phase. I plan to bring out a developer's version based on Phase 1 in a few days. Phase 1 is complete and tested, but in the process of development, I rearranged my test suite. This rearrangement worked fine for development, but it needs to be put back in proper order for CPANtesters. Many thanks for your support and patience, Jeffrey -- You received this message because you are subscribed to the Google Groups "marpa parser" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
