Re: (lib)marpa, slif-to-c, notes on lexing

Jeffrey Kegler Tue, 27 Jan 2015 14:49:10 -0800

I find the idea of a SLIF-to-C utility -- a kind of Marpa-powered
yacc/bison -- exciting.  I'm sad I'd don't have the cycles to work on it
myself.


Your ideas for Marpa-powered lexing are almost exactly those used in the
SLIF -- right down to the handling of Unicode by memoization.  And, in
fact, the SLIF uses an array-hash arrangement, as you suggest.

To do this right, you'd want to use LATM -- Longest Acceptable Token
Matching.  LATM limits matches to those lexemes acceptable to the grammar.
The SLIF currently uses a special feature of Marpa, which I'm not sure is
documented -- ZWAs or zero-width assertions.  The SLIF's lexeme grammar
(L0) is basically

     Top ::= g1lexeme1 | g1lexeme2 | g1lexeme3

and a ZWA is put in front of each g1lexeme*.  These ZWA's are turned on
only for acceptable g1lexemes.

There is another approach, one I've thought of since -- prefix each lexeme*
in L0 with a L0-level pseudo-token,  Instead of turning on/off ZWA's,
instead you read only those L0 pseudo-tokens which come before acceptable
g1lexeme*'s.

An advantage of this is that you don't have to start a new L0 recognizer
for every G1 lexeme.  The L0 grammar can be set up to read a sequence of
g1lexeme*'s, with appropriate pseudo-tokens in front of each.  Whenever you
read a g1lexeme*, you then ask G1 for the next set of acceptable lexemes,
and read the corresponding L0 pseudo-tokens into L0.

Reading several G1 lexemes in each L0 recognizer will, I think, save the
overhead of starting a new L0 recognizer for each G1 lexeme.  A
disadvantage would be its space consumption, but that can be dealt with by
starting a new L0 recognizer for every, let's say, 1000 G1 lexemes.

On Tue, Jan 27, 2015 at 1:24 PM, Andreas Kupries <[email protected]>
wrote:

> Hello all.
>
> I came very recently [1] across Marpa and given my own interest in
> parser generators [2] I dug quite a bit more into it over the last few
> days.
>
> One of the things I am interested in is a generator producing the
> parser as C code, possibly with some type of runtime to support it.
> Which I found in libmarpa.
>
> As part of my search for that I came across
>
>     <http://irclog.perlgeek.de/marpa/2014-08-23>
>
> and specifically, to quote:
>
>     jeffreykegler: we'd need to use a non-Perl regex library, for example
>
> That had me a bit surprised. Below is a sketch of an idea which simply
> uses libmarpa itself for this.
>
> OTOH, I can't really believe that nobody has had this idea before, and
> given that I am only reading through things for a few days, I consider
> it likely that I have missed it, and possibly any arguments against
> it.
>
> So, the rough idea is that given that (lib)marpa is good and efficient
> for all LR-regular languages it will be the same for the subset of
> plain regular languages as well, like i.e. the L0 of a SLIF spec as
> defined by all the rules declared with the match (~) operator. Based
> on us equating tokens with characters.
>
> The earley items are the states of the underlying (implicit) NFA and
> the earley sets become the states of the DFA we can derive from that.
> Final states are all the states with a completion of a (toplevel)
> lexeme.
>
> To handle the L0 we instantiate a grammar based on the
> match-definitions where the (toplevel) lexemes are the alternatives of
> the start symbol, and then a recognizer for that.
>
> To recognize a lexeme we
>
>   1. reset the recognizer
>   2. feed characters into it (converted to tokens [3])
>   3. until the recognizer signals exhaustion or failure.
>   4. recognized lexems are signaled to us via completion events
>   4a. we only have to remember the highest position with recognized
>       lexemes, due to LTM.
>   4b. we have to continue feeding (i.e. (2)) due to LTM as well
>   5. After (3) triggers we look if we have recognized one or more
>      lexemes.
>   5a. If not, it is an input failure at the lexical level.
>   5b. If so, we drop the :discard lexemes from the set and feed the
>       remainder into the G0 recognizer, after converting the lexemes
>       into the associated tokens.
>   6. Rinse and repeat.
>
> What is missing in the above ? ...
>
> Char classes ... I originally thought that this may require "ruby
> slippers", but that would have been complicated and on additional
> thought i decided that it this would not be needed. What is critical
> for this simplification is Marpa's alternate input model where
> multiple tokens are possible at a position.
>
> We already need a conversion table from chars to the equivalent
> symbols for the L0 recognizer. For char class support this table
> simply goes from char to __set__ of symbols, each char class gets its
> own unique token, and all chars in the class have that token in their
> conversion entry (this can be precomputed). When feeding chars into
> the recognizer we actually and simply feed in all tokens found in the
> table, for regular character, and the associated classes. The ES
> calculation then figures out which rules apply and go forward.
>
> What else is missing ?
>
> Unicode support. I assumed ASCII so far and that the conversion table
> is a C array with direct lookup by character. All the data in the
> table is (and can be) pre-computed. [4]
>
> For unicode this array can be replaced with a hash. We can still
> pre-fill it with the data for any characters directly mentioned in the
> L0 rules, be it in strings, or simple char classes. For any other
> character the char-class information has to be lazily computed and
> entered into the hash when that character actually occurs in the
> input. Entering into the hash is actually optional here, a trade-off
> between speed (multiple class checks) vs memory (cache class results).
>
> Using a hybrid of array and hash should be possible as well, with the
> array serving the ASCII range and the hash handling anything beyond
> that. This should gain us speed (array indexing vs hash lookup) given
> that even today most of the input is likely predominantly ASCII with a
> smattering of actual unicode, trading for more complexity.
>
> Notes on the conversion table:
>
> - Characters which are not listed in the L0 grammar should have no
> 'regular' token, although they may have class tokens.
>
> - A character may have an empty set of tokens associated with it.
> Finding such a character in the input is a lexical failure, and must
> be caught in step (2).
>
> Notes on char classes
>
> - The above way of supporting char classes allows us to support any
>   class, i.e. not just those with explicit sets and ranges of
>   characters, or inverted classes, but also all classes defined by a C
>   function. The SLIF language simply needs keywords for such implicit
>   classes and in the runtime this causes a call to the C function, be
>   it for the pre-calculation of the conversion table, or during its
>   lazy extension.
>
>
> Ok, I am done, feel free to now tell me in detail what I missed which
> makes this infeasible and/or a Bad Idea(tm), etc.
>
> ---
> [1] http://twit.tv/show/floss-weekly/321, after seeing the fossil-scm
>     episode (/320)
>
> [2]
> https://core.tcl.tk/tcllib/doc/trunk/embedded/www/tcllib/files/modules/pt/pt_introduction.html
>     (ParseTools, PEG based generator)
>
> [3] As libmarpa does not allow the user to predefine the symbol ids
>     for the tokens of a grammar a conversion table will be necessary
>     to map from the input character to the L0 symbol before feeding
>     them to the recognizer. That is actually not too bad a thing, see
>     the notes above about char classes.
>
> [4] The token values would be placeholders for the actual values we
>     get when the grammar is initialized. IOW I assumed that the
>     conversion table is pretty much a static array with fixed
>     data. Needs only a single init pass to rewrite the placeholders to
>     symbols. That part would require a mutex and a flag variable for
>     thread-safety. Beyond that it is read-only.
>
> --
> Andreas Kupries
> Senior Tcl Developer
> Code to Cloud: Smarter, Safer, Faster™
> F: 778.786.1133
> [email protected], http://www.activestate.com
> Learn about Stackato for Private PaaS: http://www.activestate.com/stackato
>
> --
> You received this message because you are subscribed to the Google Groups
> "marpa parser" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: (lib)marpa, slif-to-c, notes on lexing

Reply via email to