The issue was with GUID, but not b/c of the precision of the Lexer rules.   It 
might make sense to make them more general, but the real issue is that GUID and 
BAG overlap.  If I simply change the '{' on GUID to '#{' the lexer file goes 
from 10mb to 145k.   What led me to this was that the DATETIME rule is actually 
more complex than the GUID one, but it didn't affect the size much.  Because of 
the overlap, there seemed to be some multiplicative effect in the DFA.  

GUID: '{' HEX_BYTE HEX_BYTE HEX_BYTE HEX_BYTE
      '-' HEX_BYTE HEX_BYTE
      '-' HEX_BYTE HEX_BYTE
      '-' HEX_BYTE HEX_BYTE
      '-' HEX_BYTE HEX_BYTE HEX_BYTE HEX_BYTE HEX_BYTE HEX_BYTE
      '}'
;

BAG : '{' ('\\}' | ~('}'))+ '}';


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Justin Murray
Sent: Thursday, April 21, 2011 12:27 PM
To: [email protected]
Subject: Re: [antlr-interest] Really big generated C lexer?

Hi Jim,

Would you mind elaborating on this just a bit? You bring up this same concept 
quite frequently on this list, but there is something that I am missing. I 
understand how matching more generically in earlier stages will lead to better 
error message in later stages, but this seems to me to be very inefficient. 
Taking the GUID example, you could have a very simple lexer rule:

GUID: '{' (HEX_DIGIT|'-')+ '}';

This would match a valid GUID, as well as something close (including a lot of 
common mistaken forms I'm sure). This assumes that the parsing stage will 
validate this token further in its semantics. That's all well and good, but the 
problem is now you have to implement a GUID parser within the semantic actions 
of the original parser. A GUID is a rather simple example, so I could easily 
whip up some string manipulation code to validate the format, but this seems a 
bit odd to me. The whole reason we are using ANTLR in the first place is to 
avoid writing our own recognizers from scratch, so I feel like I shouldn't need 
to write one within the semantics of the ANTLR generated one. Would this be an 
appropriate use case for island grammars? This seems like a lot of added 
complexity (in both written code and execution time), seemingly only for the 
benefit of better error messages. The beauty of Chris's original implementation 
of GUID is that it does not require any validation in sema
 ntics - the lexer will only match a completely valid GUID. As you said, the 
error message may not seem all that helpful. The example that I gave above has 
nearly the same problem though. It will allow you to put the '-' anywhere in 
the GUID, but will give just as bad of an error if a '~' character is in the 
middle of the input. So perhaps a better lexer rule would be:

GUID: '{' ~('}')+ '}'

Which will accept all sorts of screwed up syntax, but allow you to produce an 
error like "A GUID cannot contain the character '~'. A GUID should be of the 
form ...". This makes the GUID recognizer within the semantics even more 
complex, because it has to handle even more invalid possibilities.

I realize that there is probably a good balance between the two extremes, but 
it is not clear to me where that would be. I guess my question is, is this 
purely a trade-off between speed/simplicity and error handling, or am I missing 
something that would allow one to get the best of both worlds?

Thanks,

- Justin

On 4/21/2011 2:36 PM, Jim Idle wrote:
> You are just trying to do too much in the lexer really so it means you 
> get a lot of tables. Left factor and don't try to validate things in 
> the lexer. For instance you just need a very generic rule for matching 
> a GUID and then verify it semantically.
>
> Use antlr.markmail.org for getting advice on pushing error messages as 
> far down the tool chain as you can and why these kinds of things happen.
> Basically though you will get:
>
> Unexpected character 'x'
>
> vs:
>
> Line 12, offset 33: "A GUID should be of the form XXXX-XXXXXXX- ...."
>
> But, the C compiler will do a good job of dealing with the code you 
> generated.
>
> Jim
>
>> -----Original Message-----
>> From: [email protected] [mailto:antlr-interest- 
>> [email protected]] On Behalf Of Chris McConnell
>> Sent: Thursday, April 21, 2011 11:23 AM
>> To: [email protected]
>> Subject: [antlr-interest] Really big generated C lexer?
>>
>> The attached grammar generates a C lexer file of 150,000 lines.  Is 
>> this typical or did I do something dumb in the grammar?  I'd attach 
>> the C lexer file, but it is 10mb...
> List: http://www.antlr.org/mailman/listinfo/antlr-interest
> Unsubscribe: 
> http://www.antlr.org/mailman/options/antlr-interest/your-email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address


List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

Reply via email to