[il-antlr-interest: 27321] Re: [antlr-interest] MismatchedTokenException

Marcin Rzeźnicki Thu, 17 Dec 2009 08:11:52 -0800

On Wed, Dec 16, 2009 at 8:23 PM, Jim Idle <[email protected]> wrote:
> I think that the problem is you are trying to use the gated predicate to 
> continue consuming. Instead just use action code and then the gated predicate 
> will just select the rule. Here is a working example:
>
> grammar T;
>
> aaa : rule+ EOF
>   ;
>
> rule
>  : classtok
>  | ident
>  ;
>
> classtok : CLASS;
> ident : IDENTIFIER;
>
> CLASS
>  :
>  'class'
>  ;
>
>
> IDENTIFIER
>  :
>  {Character.isJavaIdentifierStart(input.LA(1))}?=> . { while 
> (Character.isJavaIdentifierPart(input.LA(1))) { input.consume(); } }
>  ;
>
>  WS : (' '|'\t'|'\n'|'\r')+ { skip(); } ;
>
> As previously stated, your rule here will cause the lexer to just barf on a 
> character that is invalid. So if you construct the set of characters that 
> cannot be anything else in your token set and use that in your while loop 
> then you will be able to check the INDETIFER you pick up and validate it, 
> resulting in a much nicer error message. If you can rely on the input being 
> good, then you perhaps don't need to worry about that.
>


Unfortunately this does not work. When you try to match, say,
'classification' it breaks it into CLASS token and 'ification'
IDENTIFIER. The problem with original example I posted is that,
concluding from tokens DFA, after successful matching of a keyword
lexer tries to look beyond checking whether isIdentifierStart(LA(1))
predicate holds and checking whether it does not hold. In both cases
it makes assumption that IDENITIFER may start form anywhere (at least
that's my opinion) completely ignoring isJavaIdentifierPart guard. It
should try to match isJavaIdentifierPart(LA(1)) instead so I treat as
another bug (sigh). This partially works if I change the identifier
rule to: {Character.isJavaIdentifierStart(input.LA(1))}?=>{
Character.isJavaIdentifierPart(input.LA(1))=> .  }* which is mostly
fine because every identifier start character can also be identifier
part but then lexer explodes with myriads of states and generation
mostly ends abruptly with OutOfMemory, not to mention that the result
would probably not be very efficient. That's mostly because every
transition is accompanied with two additional predicate checks for
(another sigh). I am resigned - I expected problems with large
grammars but I've never suspected that I would be fighting mostly with
identifier matching. I am not sure if I remember correctly, but that
kind of problem was easily solvable by 'keywords' concept in ANTLRv2.
It seems that better is the enemy of good once more. Thank you very
much for your help Jim.



-- 
Greetings
Marcin Rzeźnicki

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

--

You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 27321] Re: [antlr-interest] MismatchedTokenException

Reply via email to