[il-antlr-interest: 28658] [antlr-interest] Is parser control over the lexer possible?

Chris verBurg Thu, 29 Apr 2010 16:34:15 -0700

Hey guys,

A question was posted a few days ago about dealing with an infinite input
stream, and the suggestion was to subclass TokenStream so that it didn't
read in all of the input upfront.


I'm running into a similar problem, but before I go run off and subclass
things I thought I'd see if there's a "best practice" for my situation.  It
also overlaps with the "how do I use keywords as
identifiers<http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741>"
FAQ.

I have a data-file grammar that recognizes strings, numbers, and a ton of
keywords.  Pretending "VERSION" and "LIMIT" are two keywords, here's (part
of) the .g file:

data_file:
  'VERSION' STRING ';'
  | 'LIMIT' NUMBER ';'
  ;

NUMBER:
  ('-'|'+')? ('0'..'9')+
  | ('-'|'+')? ('0'..'9')* '.' ('0'..'9')*
  ;

STRING:
  ('a'..'z' | 'A'..'Z' | '_' | '.' | '0'..'9')+ ;


Problem input #1:

VERSION 1.2 ;

The "1.2" is lexed as a number instead of a string, so I get a parse error.

Problem input #2:

VERSION LIMIT ;

The "LIMIT" is lexed as a keyword instead of a string, so I get a parse
error.


I saw the FAQ about keywords-as-identifiers, but I don't think it's helpful
for me.  For the NUMBER-that-should-be-a-STRING problem, there's no exact
string I could pass to input.LT(1).getText().equals(), because it requires a
regex to match a NUMBER.  The other solution was to make an "identifier"
rule to match all possibilities -- is the best solution here really to
change the rule to 'VERSION' (STRING | NUMBER) ';'?

For the keyword-that-should-be-a-STRING problem, I'm hesitant to use either
of those solutions because of the sheer number of keywords in this grammar.


Ideally what I'd like to do is what I did in Flex and Bison (which I'm
porting this grammar from).  What I did there was have the parser control
how the lexer interpreted subsequent tokens.  I embedded a rule in the
parser, immediately after the 'VERSION' token, to tell Flex to enter a
"force-the-next-token-to-be-a-STRING-no-matter-what" start state.  It worked
beautifully.  I got most of the way through implementing that in my ANTLR
grammar when I found out that ANTLRFileStream reads all the tokens in before
the parser even starts up -- which means the parser can't give the lexer any
direction over token interpretation.


Thoughts, suggestions, outrageous flames?  Is there a "good" way to do this,
or maybe is there a completely different approach I should take?

Thanks!
-Chris

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 28658] [antlr-interest] Is parser control over the lexer possible?

Reply via email to