Re: [antlr-dev] Alternative token storage mechanisms

Jim Idle Wed, 01 Dec 2010 12:57:28 -0800

It does show how much overhead there is to such languages compared to C
though :-)


Jim

> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Terence Parr
> Sent: Wednesday, December 01, 2010 12:50 PM
> To: Sam Harwell
> Cc: Johannes Luber ([email protected]); [email protected]
> Subject: Re: [antlr-dev] Alternative token storage mechanisms
> 
> Hi Sam. Impressive.  Is this all due to no object creation overhead?
> Ter
> On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote:
> 
> > Hi Dr. Parr,
> >
> > I revisited my old "slim parsing" work to again measure the
> performance difference against Lexer/CommonToken. Currently,
> SlimLexer/SlimToken has a limitation that it only stores type, channel,
> startIndex, and stopIndex. Each of these is limited to 16 bits.
> Originally I planned to use this for syntax highlighting, where I can
> work within those bounds. Now the basic metrics. These were tested on
> the following 4-function calculator lexer.
> >
> > tokens {
> >         MUL='*';
> >         DIV='/';
> >         MOD='%';
> >         ADD='+';
> >         SUB='-';
> > }
> >
> > IDENTIFIER
> >         :       ('a'..'z' | 'A'..'Z' | '_')
> >                 ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
> >         ;
> >
> > NUMBER
> >         :       '0'..'9'+
> >         ;
> >
> > WS
> >         :       (' ' | '\t' | '\n' | '\r' | '\f')*
> >                 {$channel = Hidden;}
> >         ;
> >
> > Memory - CommonToken (32-bit system):
> > .         8 bytes overhead for being a class
> > .         36 bytes overhead for member variables
> >
> > Memory - CommonToken (64-bit system):
> > .         16 bytes overhead for being a class (I believe that's the
> object header size)
> > .         44 bytes overhead for members
> >
> > Memory - SlimToken (32- or 64-bit systems):
> > .         8 bytes total storage, and no allocations since it's a
> value type.
> >
> > Lexer speed - CommonToken:
> > .         Total time: 10.34s
> > .         Rate: 2.71 mil tokens/sec
> >
> > Lexer speed - SlimToken:
> > .         Total time 2.87s
> > .         Rate: 9.76 mil tokens/sec
> >
> > My goal is to add enough CommonToken features back to SlimToken to
> make it usable without breaking its performance characteristics. To do
> so, I'm working on a new revision of SlimLexer that holds a ShortToken
> (backed by 32-bit int) or LongToken (backed by 64-bit int) (the lexer
> is generic in C#). The token itself stores its type (low 8-bits of
> ShortToken, 16-bits of LongToken), a flag of whether it's on the
> default channel or not (+/-), and 23- or 47-bits for the token index).
> As the lexer runs, it builds B-tree indexes for line lengths, token
> offset and (with token lengths derived). It also holds a map from
> Token->string so that it only has to track text when necessary. This
> gives O(1) access to the values that drive decision making (with (value
> & 0xF) giving the token type for ShortToken), and O(log_b(n)) access to
> other values. I expect to see a great improvement in performance with a
> very practical token for real parsing tasks.
> >
> > Sam
> 
> _______________________________________________
> antlr-dev mailing list
> [email protected]
> http://www.antlr.org/mailman/listinfo/antlr-dev

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Re: [antlr-dev] Alternative token storage mechanisms

Reply via email to