At 22:24 19/12/2009, [email protected] wrote: >A question to lexer rules and its priorities. Is there any >dependency between order of lexer rule definitions? [...] >My understanding of lexer rules is, the best rule will >match. The best rule is the rule matching the most >characters. But what about TIME and IDENTIFIER_LOWER? Both >may match the same input sequence.
Both are true. In general, the best match will win. But in cases where two rules can match the same input, then the one listed first will win. There are also some complications involved related to how ANTLR generates the lookahead code; it stops looking ahead once it sees enough input to make it eliminate all other rules, which is sometimes early enough to get it into trouble with certain kinds of input (hence the trouble with INT vs. FLOAT tokens discussed here repeatedly). I think in your case it'll be ok, but it's possible that ANTLR might get into trouble with certain kinds of input -- for example, "12h53" might be seen as a malformed TIME rather than a TIME followed by a NUMBER. There are some problems in that grammar, though. 1. The DIGIT, LOWERCASE, and UPPERCASE should almost certainly be marked as fragment rules, since you don't really want to get individual DIGIT or LOWERCASE tokens in the parser. 2. The IDENTIFIER_UPPER rule should use + instead of *; using * means that a valid IDENTIFIER_UPPER can contain zero characters, which can mean that ANTLR will get into an infinite loop producing IDENTIFIER_UPPER tokens without consuming any input. In general, no top-level lexer rule should ever permit zero consumption. 3. You have both a NEWLINE and a WS rule matching the same characters, one skipped and one not skipped. If newlines are significant to the parser then you should remove them from the WS rule; if they're not then you should remove the NEWLINE rule, or make it a fragment. 4. Your two identifier rules specify that identifiers cannot contain digits, nor can they be mixed-case. Is this actually what you wanted? 5. In the TIME rule, you are using + in a very bizarre way. Remember, it denotes repetition, not concatenation. Are you really trying to say that "12hhhhhh25mmm" is a valid TIME? 6. You should left-factor the TIME rule, so that all of the alternatives with a common left prefix are expressed together (ie. have the common left prefix followed by optional alternatives). This reduces the amount of lookahead ANTLR requires, improves performance, and helps to reduce problem ambiguity cases. List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
