It does show how much overhead there is to such languages compared to C though :-)
Jim > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of Terence Parr > Sent: Wednesday, December 01, 2010 12:50 PM > To: Sam Harwell > Cc: Johannes Luber ([email protected]); [email protected] > Subject: Re: [antlr-dev] Alternative token storage mechanisms > > Hi Sam. Impressive. Is this all due to no object creation overhead? > Ter > On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote: > > > Hi Dr. Parr, > > > > I revisited my old "slim parsing" work to again measure the > performance difference against Lexer/CommonToken. Currently, > SlimLexer/SlimToken has a limitation that it only stores type, channel, > startIndex, and stopIndex. Each of these is limited to 16 bits. > Originally I planned to use this for syntax highlighting, where I can > work within those bounds. Now the basic metrics. These were tested on > the following 4-function calculator lexer. > > > > tokens { > > MUL='*'; > > DIV='/'; > > MOD='%'; > > ADD='+'; > > SUB='-'; > > } > > > > IDENTIFIER > > : ('a'..'z' | 'A'..'Z' | '_') > > ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')* > > ; > > > > NUMBER > > : '0'..'9'+ > > ; > > > > WS > > : (' ' | '\t' | '\n' | '\r' | '\f')* > > {$channel = Hidden;} > > ; > > > > Memory - CommonToken (32-bit system): > > . 8 bytes overhead for being a class > > . 36 bytes overhead for member variables > > > > Memory - CommonToken (64-bit system): > > . 16 bytes overhead for being a class (I believe that's the > object header size) > > . 44 bytes overhead for members > > > > Memory - SlimToken (32- or 64-bit systems): > > . 8 bytes total storage, and no allocations since it's a > value type. > > > > Lexer speed - CommonToken: > > . Total time: 10.34s > > . Rate: 2.71 mil tokens/sec > > > > Lexer speed - SlimToken: > > . Total time 2.87s > > . Rate: 9.76 mil tokens/sec > > > > My goal is to add enough CommonToken features back to SlimToken to > make it usable without breaking its performance characteristics. To do > so, I'm working on a new revision of SlimLexer that holds a ShortToken > (backed by 32-bit int) or LongToken (backed by 64-bit int) (the lexer > is generic in C#). The token itself stores its type (low 8-bits of > ShortToken, 16-bits of LongToken), a flag of whether it's on the > default channel or not (+/-), and 23- or 47-bits for the token index). > As the lexer runs, it builds B-tree indexes for line lengths, token > offset and (with token lengths derived). It also holds a map from > Token->string so that it only has to track text when necessary. This > gives O(1) access to the values that drive decision making (with (value > & 0xF) giving the token type for ShortToken), and O(log_b(n)) access to > other values. I expect to see a great improvement in performance with a > very practical token for real parsing tasks. > > > > Sam > > _______________________________________________ > antlr-dev mailing list > [email protected] > http://www.antlr.org/mailman/listinfo/antlr-dev _______________________________________________ antlr-dev mailing list [email protected] http://www.antlr.org/mailman/listinfo/antlr-dev
