You're making this too complicated. Parse the identifier as loosely as
absolutely possible. Many improper identifiers actually don't cause any
problems in parsing, so you can treat them as valid and provide compiler error
messages like semantics problems in post-AST analysis - the identifiers are
just string literal keys to reference code constructs. After you perform
semantic analysis check each identifier (variable and method names, etc.) by
calling the Character class methods. Log the errors, but you don't have to stop
the analysis from just that.
The general rule is don't engineer your parser to fail until you can no longer
provide useful error messages. You can always manually stop early - for example
sometimes I throw an OperationCancelledException in an error listener to stop a
background parse for IDE IntelliSense after a user-specified number of errors
are logged.
I may have missed a couple chars that are used by other language constructs
(Jim?), but this should be close:
IDENTIFIER
: IDENTIFIER_START
IDENTIFIER_CHAR*
;
fragment
IDENTIFIER_START
: ~(OPERATOR_CHAR | LITERAL_CHAR | DIGIT | WS_CHAR)
;
fragment
IDENTIFIER_CHAR
: ~(OPERATOR_CHAR | LITERAL_CHAR | WS_CHAR)
;
fragment
OPERATOR_CHAR
: '+' | '-' | '~' | '!' | '*' | '/' | '%'
| '<' | '>' | '=' | '&' | '^' | '|' | '?' | ':'
| ';' | '\\' | '.'
;
fragment
LITERAL_CHAR
: '"' | '\''
;
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Marcin Rzeznicki
Sent: Thursday, December 10, 2009 10:27 AM
To: Jim Idle
Cc: [email protected]
Subject: Re: [antlr-interest] Lexer and Java keywords
On Thu, Dec 10, 2009 at 8:59 AM, Jim Idle <[email protected]> wrote:
> No - this is the wrong technique as what happens is that the lexer is simpler
> but still rejects malformed identifiers in the wrong way. You have to look
> for a valid start character, then consume until something MUST be something
> other than an identifier character. What you are looking to do is interpolate
> an indentifier that has invalid characters, then issue "Identifiers cannot
> contain character 'xxxx'" etc. The trick is to not match characters that are
> identifiers but stop on characters that definitely cannot be. There is a
> subset that reduces the error margins considerably. Otherwise you throw
> lexical errors and bunches of unrelated errors.
>
I approached the problem as you suggested - using semantic predicates.
I'll have yet to test how it behaves when malformed input is read, but
I think this change made the parser more efficient. I transformed
IDENTIFIER rule to:
IDENTIFIER
:
{Character.isJavaIdentifierStart(input.LA(1))}?=> . (
{Character.isJavaIdentifierPart(input.LA(1))}?=> . )*
;
--
Greetings
Marcin Rzeźnicki
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address
--
You received this message because you are subscribed to the Google Groups
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/il-antlr-interest?hl=en.