If you have control of the language, I'd change it to make it easier... If you don't, that's much harder. I'd parse it in two passes. One that handles <!-- --> as a single token, and one is feed the input for <!-- --> and parses it.
That's been my plan on handling similar issues in a Wiki-like language. The only other way to handle (that I know of) it is with a lot of error handling. The fact that you're mixing two things, one that is totally regular and structured, inside the same area is a problem. There's a reason every language I know of has an explicit comment that is totally unstructured other then the delimiters. HTH, Kirby On Mon, Apr 19, 2010 at 3:45 AM, Cor Geboers <[email protected]> wrote: > > Hi, I have a problem with a parser which needs to interpret a comment in a > command language. The CL uses commands inside an HTML command pair: '<!--' > command '-->' and I can parse most commands, except for the REM command which > is a comment remark and should be ignored. > I wrote a small test grammar, which shows the problem more or less: > > grammar Remarks; > > options { > language = Java; > } > > rule: commandLine+ ; > > commandLine > : '<!--' command '-->' > ; > > command > : breakCommand > | remarkCommand > ; > > remarkCommand > : REM (.)* > ; > > breakCommand > : BREAK > ; > > WS > : (' ' | '\t' | '\r' | '\n')+ { $channel = HIDDEN; } > ; > > REM > : '#' ('R'|'r') ('E'|'e') ('M'|'m') > ; > > BREAK > : '#' ('B'|'b')('R'|'r')('E'|'e')('A'|'a')('K'|'k'); > > IDENT : ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'9')*; > > A sample command file might look like this: > > <!-- #rem some comment --> > <!-- #break --> > <!-- #rem some comment with $AAA &*&^, A9a 5eee and 99922 and .<><> --> > > The parser recognizes the rem commands and the break command, but some > characters are lost. It also divides the "comment" text into other tokens > (IDENT in this case). Ideally I would like to get all characters back as one > part, but I tried several constructs without any result. > The last line is even parsed worse: all "special" characters like $, &, etc > are generating warnings and not found back into the tokens. The > errors/warnings generated are like this: > > line 3:28 no viable alternative at character '$' > line 3:33 no viable alternative at character '&' > line 3:34 no viable alternative at character '*' > line 3:35 no viable alternative at character '&' > line 3:36 no viable alternative at character '^' > line 3:37 no viable alternative at character ',' > line 3:43 no viable alternative at character '5' > line 3:52 no viable alternative at character '9' > line 3:53 no viable alternative at character '9' > > How can I create the comment, so that all characters are either ignored or > returned as one rule or token ? It should do so only when inside a comment. I > looked at other grammars for comments, like C with /* */ and see they do > about the same. > > _________________________________________________________________ > Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. > https://signup.live.com/signup.aspx?id=60969 > > List: http://www.antlr.org/mailman/listinfo/antlr-interest > Unsubscribe: > http://www.antlr.org/mailman/options/antlr-interest/your-email-address > List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
