[il-antlr-interest: 30342] [antlr-interest] complex lexer (at least to me)

Stanislas Rusinsky Fri, 15 Oct 2010 04:19:56 -0700

Hi list,

while doing a parser I ran into the trouble of lexing correctly comments and 
non-comments that look like comments.


Comments starts with a '#' and ends at newline, they should be hidden.
BUT '#!something' is an ID
and ':#header' has its meaning too

I've tried several ways which never worked enough, synpreds, ...

This one eats everything in the last option:

COLUMN_NAMES_END
    : HASH HEADER {System.out.println(" ^~^ LEXER: found HEADER_COMMENT: " + 
$text); };
DBT_UNIT_NAME_START
    : HASH BANG {System.out.println(" ^~^ LEXER: found DBT_UNIT_NAME_START: " + 
$text); };
LINE_COMMENT_OR_ELSE
    : ( HASH BANG )     => DBT_UNIT_NAME_START{ $type = DBT_UNIT_NAME_START; 
System.out.println(" ^~^ LEXER: found HASH BANG: DBT_UNIT_NAME_START: " + 
$text); }
    | ( HASH HEADER )     => COLUMN_NAMES_END     { $type = COLUMN_NAMES_END;   
 
System.out.println(" ^~^ LEXER: found HASH HEADER: COLUMN_NAMES_END: " + 
$text); 
}
    | ( HASH (options {greedy=false;} : .)* NEWLINE )   => COMMENT 
{System.out.println(" ^~^ LEXER: LINE_COMMENT Ignoring LINE comment: " + 
$text); 
}
       ;
protected 
COMMENT    : HASH (options {greedy=false;} : .)* NEWLINE 
    {$channel=HIDDEN; System.out.println(" ^~^ LEXER: COMMENT: Ignoring LINE 
comment: " + $text); }
    ;

So every '#' line ends up caught by COMMENT and I get this unique error message 
on grammar generation:
     [java] error(208): JADATextGrammar.g:98:1: The following token definitions 
can never be matched because prior tokens match the same input: COMMENT

Any ideas??

Stanislas Herman Rusinsky.


P.S.: From the article "What makes a language problem hard?" ( 
http://www.antlr.org/wiki/pages/viewpage.action?pageId=1773 )it looks like I 
meet those:

        * Context sensitive lexer?  You can't decide what vocabulay symbol to 
match 
unless you know what kind of sentence you are parsing.
        * Is the set of all input fixed? If you have a fixed set of files to 
convert, 
your job is much easier because the set of language construct combinations is 
fixed. For example, building a general Pascal to Java translator is much harder 
than building a translator for a set of 50 existing Pascal files.
        * Are delimiters non-fixed for things like strings and comments?  That 
makes it 
tough to build an efficient lexer.
        * Are the source statements really similar; declarations vs expressions 
in C++?
        * Column sensitive input? E.g., are newlines significant like lines in 
a log 
file and does the position of an item change its meaning?
        * Does your input have comments as you do in programming languages that 
can 
occur anywhere in the input and need to go into the output in a sane location?
        * How much semantic information do you need to do the translation? For 
example, 
do you need to simply know that something is a type name or do you need to know 
that it is, say, an array whose indices are a set like (day,week,month) and 
contains records? Sometimes syntax alone is enough to do translation.
        * ...


      

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 30342] [antlr-interest] complex lexer (at least to me)

Reply via email to