[il-antlr-interest: 28309] [antlr-interest] ANTLR seems to be incorrectly generating a lexer

Andrew Haley Thu, 18 Mar 2010 12:02:24 -0700

Consider this very simple grammar to recognize strings with no embedded '"'.
ANTLR seems to be generating an incorrect lexer for StringPart.


grammar small;

defaults        
    : StringPart EOF
    ;
        
StringPart
    :    ( ~ NonStringChars) *
    ;
    
fragment
NonStringChars
    :    '"'
    ;

Look inside smallLexer.java, and

    // $ANTLR start "StringPart"
    public final void mStringPart() throws RecognitionException {
        try {
            int _type = StringPart;
            int _channel = DEFAULT_TOKEN_CHANNEL;
            // /home/aph/ceylon/small.g:8:5: ( (~ NonStringChars )* )
            // /home/aph/ceylon/small.g:8:10: (~ NonStringChars )*
            {
            // /home/aph/ceylon/small.g:8:10: (~ NonStringChars )*
            loop1:
            do {
                int alt1=2;
                int LA1_0 = input.LA(1);

                if ( ((LA1_0>='\u0000' && LA1_0<='!')||(LA1_0>='#' && 
LA1_0<='\uFFFF')) ) {
                    alt1=1;
                }


                switch (alt1) {
                case 1 :
                    // /home/aph/ceylon/small.g:8:12: ~ NonStringChars
                    {

// ********************************************** Here's the bug:
                    if ( (input.LA(1)>='\u0000' && 
input.LA(1)<='\u0004')||(input.LA(1)>='\u0006' && input.LA(1)<='\uFFFF') ) {
                        input.consume();
// **************************************************************
                    }
                    else {
                        MismatchedSetException mse = new 
MismatchedSetException(null,input);

What on Earth is
 
             input.LA(1)<='\u0004')||(input.LA(1)>='\u0006'

supposed to do?  It clearly excludes control character 5, but why?  If
I change the grammar for StringPart to

StringPart
    :    ( ~ '"') *
    ;
    
I get

                    if ( (input.LA(1)>='\u0000' && 
input.LA(1)<='!')||(input.LA(1)>='#' && input.LA(1)<='\uFFFF') ) {
                        input.consume();

which is right, I think.  So, replacing NonStringChars with '"' in the
grammar fixes the problem.

This is all very strange.  It seems that the parser generator is
inlining NonStringChars but getting it wrong.

This is ANTLR 3.2, by the way.

Andrew.

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 28309] [antlr-interest] ANTLR seems to be incorrectly generating a lexer

Reply via email to