Right, but if I make my tokens like yours and carry no information to speak of. And also pre-create the collections etc.
Also, send me your tests and I will have a look as those times do not seem correct to me. Jim > -----Original Message----- > From: Sam Harwell [mailto:[email protected]] > Sent: Wednesday, December 01, 2010 3:28 PM > To: 'Jim Idle' > Cc: [email protected] > Subject: RE: [antlr-dev] Alternative token storage mechanisms > > Same test on the C target release build with full optimizations for > speed (including ANTLR3_INLINE_INPUT_UTF16): > > * Overhead (32-bit): 148 bytes/token > * Total parse time: 5.88s > * Rate: 4.76 mil tokens/sec > > The tree implementation I proposed for C# offers a significant raw > performance (speed) boost over the C target with optimizations, but > uses less than 10 bytes/token. :) > > I imagine you could pick up a lot by sharing the API portion of your > tokens (a pointer to a struct with the shared function pointers). > > Sam > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of Jim Idle > Sent: Wednesday, December 01, 2010 2:57 PM > Cc: [email protected] > Subject: Re: [antlr-dev] Alternative token storage mechanisms > > It does show how much overhead there is to such languages compared to C > though :-) > > Jim > > > -----Original Message----- > > From: [email protected] [mailto:antlr-dev- > [email protected]] > > On Behalf Of Terence Parr > > Sent: Wednesday, December 01, 2010 12:50 PM > > To: Sam Harwell > > Cc: Johannes Luber ([email protected]); [email protected] > > Subject: Re: [antlr-dev] Alternative token storage mechanisms > > > > Hi Sam. Impressive. Is this all due to no object creation overhead? > > Ter > > On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote: > > > > > Hi Dr. Parr, > > > > > > I revisited my old "slim parsing" work to again measure the > > performance difference against Lexer/CommonToken. Currently, > > SlimLexer/SlimToken has a limitation that it only stores type, > > channel, startIndex, and stopIndex. Each of these is limited to 16 > bits. > > Originally I planned to use this for syntax highlighting, where I can > > work within those bounds. Now the basic metrics. These were tested on > > the following 4-function calculator lexer. > > > > > > tokens { > > > MUL='*'; > > > DIV='/'; > > > MOD='%'; > > > ADD='+'; > > > SUB='-'; > > > } > > > > > > IDENTIFIER > > > : ('a'..'z' | 'A'..'Z' | '_') > > > ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')* > > > ; > > > > > > NUMBER > > > : '0'..'9'+ > > > ; > > > > > > WS > > > : (' ' | '\t' | '\n' | '\r' | '\f')* > > > {$channel = Hidden;} > > > ; > > > > > > Memory - CommonToken (32-bit system): > > > . 8 bytes overhead for being a class > > > . 36 bytes overhead for member variables > > > > > > Memory - CommonToken (64-bit system): > > > . 16 bytes overhead for being a class (I believe that's the > > object header size) > > > . 44 bytes overhead for members > > > > > > Memory - SlimToken (32- or 64-bit systems): > > > . 8 bytes total storage, and no allocations since it's a > > value type. > > > > > > Lexer speed - CommonToken: > > > . Total time: 10.34s > > > . Rate: 2.71 mil tokens/sec > > > > > > Lexer speed - SlimToken: > > > . Total time 2.87s > > > . Rate: 9.76 mil tokens/sec > > > > > > My goal is to add enough CommonToken features back to SlimToken to > > make it usable without breaking its performance characteristics. To > do > > so, I'm working on a new revision of SlimLexer that holds a > ShortToken > > (backed by 32-bit int) or LongToken (backed by 64-bit int) (the lexer > > is generic in C#). The token itself stores its type (low 8-bits of > > ShortToken, 16-bits of LongToken), a flag of whether it's on the > > default channel or not (+/-), and 23- or 47-bits for the token > index). > > As the lexer runs, it builds B-tree indexes for line lengths, token > > offset and (with token lengths derived). It also holds a map from > > Token->string so that it only has to track text when necessary. This > > gives O(1) access to the values that drive decision making (with > > (value & 0xF) giving the token type for ShortToken), and O(log_b(n)) > > access to other values. I expect to see a great improvement in > > performance with a very practical token for real parsing tasks. > > > > > > Sam > > > > _______________________________________________ > > antlr-dev mailing list > > [email protected] > > http://www.antlr.org/mailman/listinfo/antlr-dev > > _______________________________________________ > antlr-dev mailing list > [email protected] > http://www.antlr.org/mailman/listinfo/antlr-dev _______________________________________________ antlr-dev mailing list [email protected] http://www.antlr.org/mailman/listinfo/antlr-dev
