[il-antlr-interest: 32476] [antlr-interest] Parsing "comment-like" sequences of arbitrary characters

Rajesh Raman Tue, 17 May 2011 15:57:35 -0700

Hello ANTLR-ites,

I'm trying to parse an "options" structure, like the following:


options {
   foo {
      bar {
         ww: $32.50;
         xx: Jekyll & Hyde;
      }
      yy.zz: @15% p/a;
   }
}

(Please ignore the non-sensical values for ww, xx and yy.zz -- I'm just making 
a point, which will become clearer below).  This options structure will be 
followed by a query expression whose grammar is more complicated, and includes 
ints/floats, identifiers, operators, etc. etc.

The grammar I have for parsing the options structure looks like the below. (The 
grammar for the query language is complicated and therefore omitted.)

<snip>

// ... other stuff here
tokens {
   // ... other ad hoc token values
   OPTION;
   OPTION_BLOCK;
   OPTION_VALUE;
}

// ...

query_options
  : OPTIONS^ option_block
  ;

option_block
  : L_BRACE option_def* R_BRACE ->
    ^(OPTION_BLOCK option_def*)
  ;

option_def
  : option_name option_value ->
    ^(OPTION option_name option_value)
  ;

option_name
  : ID (DOT^ ID)*
  ;

option_value
  : COLON^ (~SEMICOLON)* SEMICOLON!
  | option_block
  ;

//... other stuff here
//...

OPTIONS: 'options';
ID: (LETTER | '_') (LETTER | DIGIT | '_')*;
DOT: '.';
L_BRACE: '{';
R_BRACE: '}';
COLON: ':';
SEMICOLON: ';';

SL_COMMENT: '#' ~('\r' | '\n')* NEWLINE { skip(); };
WS: (' ' | '\f' | '\r' | '\t')+ { skip(); };

...

</snip>

As mentioned, the "options" clause is part of a larger grammar for a language 
that includes operators, identifiers, numbers, etc.,  However, within the 
options clause, I want the characters between the colon and the semicolon to be 
treated as a single string, regardless of the fact that it may contain 
characters that lex into other tokens used by the language.  This feels like I 
should be able to use the same techniques as used in comment-stripping (i.e,. 
see the line that has COLON^...).  But this doesn't seem to work:
-  The "stray" characters that are not used elsewhere in the grammar are 
ignored and don't show up in the parse tree (e.g., $, @, %, &, in the example 
above)
-  Character sequences that form valid tokens for the rest of the language 
(like integers or identifiers) are lexed into those respective tokens instead 
of being slurped into a single string as intended.

E.g., when I input a string like "options { foo: $ %     1 2 45 ^ $ $$$; }" and 
display the resulting tree.toStringTree(), I get
"(options (OPTION_BLOCK (OPTION foo (: 1 2 45))))"

Any guidance you have on the above will be greatly appreciated.

Thanks in advance.

++Rajesh

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 32476] [antlr-interest] Parsing "comment-like" sequences of arbitrary characters

Reply via email to