Re: [antlr-dev] Alternative token storage mechanisms

Jim Idle Wed, 01 Dec 2010 16:10:02 -0800

Right, but if I make my tokens like yours and carry no information to speak
of. And also pre-create the collections etc.


Also, send me your tests and I will have a look as those times do not seem
correct to me.

Jim

> -----Original Message-----
> From: Sam Harwell [mailto:[email protected]]
> Sent: Wednesday, December 01, 2010 3:28 PM
> To: 'Jim Idle'
> Cc: [email protected]
> Subject: RE: [antlr-dev] Alternative token storage mechanisms
> 
> Same test on the C target release build with full optimizations for
> speed (including ANTLR3_INLINE_INPUT_UTF16):
> 
> * Overhead (32-bit): 148 bytes/token
> * Total parse time: 5.88s
> * Rate: 4.76 mil tokens/sec
> 
> The tree implementation I proposed for C# offers a significant raw
> performance (speed) boost over the C target with optimizations, but
> uses less than 10 bytes/token. :)
> 
> I imagine you could pick up a lot by sharing the API portion of your
> tokens (a pointer to a struct with the shared function pointers).
> 
> Sam
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Jim Idle
> Sent: Wednesday, December 01, 2010 2:57 PM
> Cc: [email protected]
> Subject: Re: [antlr-dev] Alternative token storage mechanisms
> 
> It does show how much overhead there is to such languages compared to C
> though :-)
> 
> Jim
> 
> > -----Original Message-----
> > From: [email protected] [mailto:antlr-dev-
> [email protected]]
> > On Behalf Of Terence Parr
> > Sent: Wednesday, December 01, 2010 12:50 PM
> > To: Sam Harwell
> > Cc: Johannes Luber ([email protected]); [email protected]
> > Subject: Re: [antlr-dev] Alternative token storage mechanisms
> >
> > Hi Sam. Impressive.  Is this all due to no object creation overhead?
> > Ter
> > On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote:
> >
> > > Hi Dr. Parr,
> > >
> > > I revisited my old "slim parsing" work to again measure the
> > performance difference against Lexer/CommonToken. Currently,
> > SlimLexer/SlimToken has a limitation that it only stores type,
> > channel, startIndex, and stopIndex. Each of these is limited to 16
> bits.
> > Originally I planned to use this for syntax highlighting, where I can
> > work within those bounds. Now the basic metrics. These were tested on
> > the following 4-function calculator lexer.
> > >
> > > tokens {
> > >         MUL='*';
> > >         DIV='/';
> > >         MOD='%';
> > >         ADD='+';
> > >         SUB='-';
> > > }
> > >
> > > IDENTIFIER
> > >         :       ('a'..'z' | 'A'..'Z' | '_')
> > >                 ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
> > >         ;
> > >
> > > NUMBER
> > >         :       '0'..'9'+
> > >         ;
> > >
> > > WS
> > >         :       (' ' | '\t' | '\n' | '\r' | '\f')*
> > >                 {$channel = Hidden;}
> > >         ;
> > >
> > > Memory - CommonToken (32-bit system):
> > > .         8 bytes overhead for being a class
> > > .         36 bytes overhead for member variables
> > >
> > > Memory - CommonToken (64-bit system):
> > > .         16 bytes overhead for being a class (I believe that's the
> > object header size)
> > > .         44 bytes overhead for members
> > >
> > > Memory - SlimToken (32- or 64-bit systems):
> > > .         8 bytes total storage, and no allocations since it's a
> > value type.
> > >
> > > Lexer speed - CommonToken:
> > > .         Total time: 10.34s
> > > .         Rate: 2.71 mil tokens/sec
> > >
> > > Lexer speed - SlimToken:
> > > .         Total time 2.87s
> > > .         Rate: 9.76 mil tokens/sec
> > >
> > > My goal is to add enough CommonToken features back to SlimToken to
> > make it usable without breaking its performance characteristics. To
> do
> > so, I'm working on a new revision of SlimLexer that holds a
> ShortToken
> > (backed by 32-bit int) or LongToken (backed by 64-bit int) (the lexer
> > is generic in C#). The token itself stores its type (low 8-bits of
> > ShortToken, 16-bits of LongToken), a flag of whether it's on the
> > default channel or not (+/-), and 23- or 47-bits for the token
> index).
> > As the lexer runs, it builds B-tree indexes for line lengths, token
> > offset and (with token lengths derived). It also holds a map from
> > Token->string so that it only has to track text when necessary. This
> > gives O(1) access to the values that drive decision making (with
> > (value & 0xF) giving the token type for ShortToken), and O(log_b(n))
> > access to other values. I expect to see a great improvement in
> > performance with a very practical token for real parsing tasks.
> > >
> > > Sam
> >
> > _______________________________________________
> > antlr-dev mailing list
> > [email protected]
> > http://www.antlr.org/mailman/listinfo/antlr-dev
> 
> _______________________________________________
> antlr-dev mailing list
> [email protected]
> http://www.antlr.org/mailman/listinfo/antlr-dev


_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Re: [antlr-dev] Alternative token storage mechanisms

Reply via email to