[il-antlr-interest: 28517] Re: [antlr-interest] Comments parser and non-alphanum characters

Kirby Bohling Mon, 19 Apr 2010 06:09:25 -0700

If you have control of the language, I'd change it to make it easier...

If you don't, that's much harder.  I'd parse it in two passes.  One
that handles <!-- --> as a single token, and one is feed the input for
<!-- --> and parses it.


That's been my plan on handling similar issues in a Wiki-like
language.  The only other way to handle (that I know of) it is with a
lot of error handling.  The fact that you're mixing two things, one
that is totally regular and structured, inside the same area is a
problem.  There's a reason every language I know of has an explicit
comment that is totally unstructured other then the delimiters.

HTH,
Kirby

On Mon, Apr 19, 2010 at 3:45 AM, Cor Geboers <[email protected]> wrote:
>
> Hi, I have a problem with a parser which needs to interpret a comment in a 
> command language. The CL uses commands inside an HTML command pair: '<!--' 
> command '-->' and I can parse most commands, except for the REM command which 
> is a comment remark and should be ignored.
> I wrote a small test grammar, which shows the problem more or less:
>
> grammar Remarks;
>
> options {
>  language = Java;
> }
>
> rule: commandLine+ ;
>
> commandLine
>    :   '<!--' command '-->'
>    ;
>
> command
>    :   breakCommand
>    |   remarkCommand
>    ;
>
> remarkCommand
>    :   REM (.)*
>    ;
>
> breakCommand
>    :   BREAK
>    ;
>
> WS
>    :   (' ' | '\t' | '\r' | '\n')+ { $channel = HIDDEN; }
>    ;
>
> REM
>    :   '#' ('R'|'r') ('E'|'e') ('M'|'m')
>    ;
>
> BREAK
>    :   '#' ('B'|'b')('R'|'r')('E'|'e')('A'|'a')('K'|'k');
>
> IDENT : ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'9')*;
>
> A sample command file might look like this:
>
> <!-- #rem some comment -->
> <!--        #break -->
> <!-- #rem some comment with $AAA &*&^, A9a 5eee and 99922 and .<><> -->
>
> The parser recognizes the rem commands and the break command, but some 
> characters are lost. It also divides the "comment" text into other tokens 
> (IDENT in this case). Ideally I would like to get all characters back as one 
> part, but I tried several constructs without any result.
> The last line is even parsed worse: all "special" characters like $, &, etc 
> are generating warnings and not found back into the tokens. The 
> errors/warnings generated are like this:
>
> line 3:28 no viable alternative at character '$'
> line 3:33 no viable alternative at character '&'
> line 3:34 no viable alternative at character '*'
> line 3:35 no viable alternative at character '&'
> line 3:36 no viable alternative at character '^'
> line 3:37 no viable alternative at character ','
> line 3:43 no viable alternative at character '5'
> line 3:52 no viable alternative at character '9'
> line 3:53 no viable alternative at character '9'
>
> How can I create the comment, so that all characters are either ignored or 
> returned as one rule or token ? It should do so only when inside a comment. I 
> looked at other grammars for comments, like C with /* */ and see they do 
> about the same.
>
> _________________________________________________________________
> Your E-mail and More On-the-Go. Get Windows Live Hotmail Free.
> https://signup.live.com/signup.aspx?id=60969
>
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: 
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address
>

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 28517] Re: [antlr-interest] Comments parser and non-alphanum characters

Reply via email to