Re: (lib)marpa, slif-to-c, notes on lexing

Andreas Kupries Tue, 27 Jan 2015 15:43:46 -0800

On Tue, Jan 27, 2015 at 2:48 PM, Jeffrey Kegler
<[email protected]> wrote:
> I find the idea of a SLIF-to-C utility -- a kind of Marpa-powered yacc/bison
> -- exciting.  I'm sad I'd don't have the cycles to work on it myself.


Don't I know that.

> Your ideas for Marpa-powered lexing are almost exactly those used in the
> SLIF -- right down to the handling of Unicode by memoization.  And, in fact,
> the SLIF uses an array-hash arrangement, as you suggest.

The "difference" to my proposal then seems to be that for Marpa::R2
all this happens in either the Perl layer, or the Perl/C glue layer to
libmarpa.

> To do this right, you'd want to use LATM -- Longest Acceptable Token
> Matching.  LATM limits matches to those lexemes acceptable to the grammar.
> The SLIF currently uses a special feature of Marpa, which I'm not sure is
> documented -- ZWAs or zero-width assertions.

I saw
    
http://jeffreykegler.github.io/Marpa-web-site/libmarpa_api/latest/api_one_page.html#Zero_002dwidth-assertion-methods
when reading the libmarpa API document. I also saw that this was under
the WIP/Untested section and so was not really bothered that the
documentation was basically non-existent and had no real explanations.

>  The SLIF's lexeme grammar (L0)
> is basically
>
>      Top ::= g1lexeme1 | g1lexeme2 | g1lexeme3
>
> and a ZWA is put in front of each g1lexeme*.  These ZWA's are turned on only
> for acceptable g1lexemes.

I believe I understand, even if not how the ZWA are playing into
making this work. ... Actually, I dimly seem to remember that "lex"
has some sort of way of guarding lexemes, i.e. conditionally exclude
them from recognition. This seems to be the same.

Basically whenever the G1 needs a new token you are asking its
recognizer what set T of tokens are acceptable now, and then you feed
that information back to the lexer to constrain it to either return a
token in T, or nothing (*).

I guess that in this scheme the lexemes marked as :discard are always
active/acceptable so that we can skip them wherever they occur.
between actual G1 lexemes

(*) And if the lexer has nothing you can yield to the outside to let
them apply ruby slippers, should they wish to. An interesting point is
that in this scheme the lexer side will never feed a bad token into
the G1, but abort before that happens. Early lex error recognition
instead of having to check the G1 recognizer after feeding it
something. Neat. Actually, also, when preparing for the next input
finding that T is empty is then the signal of a full parse failure. No
way forward anymore.

> There is another approach, one I've thought of since -- prefix each lexeme*
> in L0 with a L0-level pseudo-token,  Instead of turning on/off ZWA's,
> instead you read only those L0 pseudo-tokens which come before acceptable
> g1lexeme*'s.

> An advantage of this is that you don't have to start a new L0 recognizer for
> every G1 lexeme.

That was actually not clear to me when writing up my idea. In
hindsight I realize that I did not see any libmarpa API method with
which a recognizer could be "reset" into an initial state, i.e.
restarted from scratch after a recognition. Having that was sort of
implied in my idea. "Parse" a lexeme, throw away the accumulated parse
table, start from scratch for the next lexeme, using the same object.

>  The L0 grammar can be set up to read a sequence of
> g1lexeme*'s, with appropriate pseudo-tokens in front of each.  Whenever you
> read a g1lexeme*, you then ask G1 for the next set of acceptable lexemes,
> and read the corresponding L0 pseudo-tokens into L0.
>
> Reading several G1 lexemes in each L0 recognizer will, I think, save the
> overhead of starting a new L0 recognizer for each G1 lexeme.  A disadvantage
> would be its space consumption,

I guess that the memory here is the accumulated parse table of
earlemes and earleysets.

> but that can be dealt with by starting a new
> L0 recognizer for every, let's say, 1000 G1 lexemes.

How complex would it be for the internals of a recognizer to release
the parse table and other allocated things to get back into an initial
state ? It would be sort of a 'destroy', then 'create', except that
the main structure would not be thrown away and re-allocated, but
re-used instead.

I am not feeling optimistic actually, re-reading the recognizer API.
Destruction is only implicit in the "unref" and the "reset" would need
to force destruction. Likely very bad for a recount > 1.

Your way of fully recreating the object after every X lexemes might be
the best we can do.


-- 
Andreas Kupries
Senior Tcl Developer
Code to Cloud: Smarter, Safer, Faster™
F: 778.786.1133
[email protected], http://www.activestate.com
Learn about Stackato for Private PaaS: http://www.activestate.com/stackato

-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: (lib)marpa, slif-to-c, notes on lexing

Reply via email to