Re: New developer's version, with forgiving tokens

amon Wed, 08 Jan 2014 12:39:15 -0800

Thank you *so much*. This is the behavior anyone would expect from a 
*scannerless* interface. It also happens to remove one of the three main 
motivations for my IRIF project :-)


Calling this feature “forgiving” is probably a good idea although it 
assumes enough familiarity with writing your own lexer for Marpa to 
understand what it means. I think that other names like “variable size”, 
“best length”, “informed lexing”[1], or “context aware lexing”[2] might be 
more beginner-friendly even if it's *implemented* as a forgiveness 
operation – but the question is who you are optimizing for. One could also 
consider that forgiving lexing is somewhat backwards compatible (any SLIF 
grammar that parsed successfully will continue to parse the same way with 
forgiving lexing). One might therefore make forgiveness the default and 
call the current behaviour “naive”[3] or “traditional”. But eh, names are 
moot as soon as this is documented.

[1]: amazingly, this awesome term has not yet been coined.
[2]: see *Context-Aware Scanning For Parsing Extensible Languages* by Van 
Wyk & Schwerdfeger, which seems to describe longest acceptable token 
matching (guessing from the abstract). The disadvantage is that you 
don't want to have been misunderstood as saying “context-*sensitive*”.
[3]: see that Stack Overflow question of mine…


Now I have a few questions concerning the exact semantics.

Here is how the SLIF seems to work with naive lexing:

> all lexemes → find longest → accept that, or fail 


Here is how the SLIF seems to work with context aware lexing:

> all lexemes → find longest match that is also accepted, or fail

Is this interpretation correct? 

Here is how my mind (and the IRIF and Repa) work:

> all lexemes → find those that *can* be accepted → match longest, or fail

which is desirable in a regex-based scanner that has to test all possible 
tokens sequentially, as it narrows the search space. I accordingly refer to 
this as *longest acceptable token matching*, which hints at the different 
implementation.

   1. In case of multiple distinct longest acceptable tokens at a certain 
   position:
   Are all of them still being recognized? Expected: yes.
   
   2. Given the grammar "A ::= B C | C C; B ~ 'a'+; C ~ 'aa'" and the input 
   "aaaa":
   (Why) does this fail? Expected for all variants: failure because "B ~ 
   'a'+" matches the whole input, thus starving "C".

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: New developer's version, with forgiving tokens

Reply via email to