Re: [antlr-dev] Alternative token storage mechanisms

Sam Harwell Wed, 01 Dec 2010 15:28:36 -0800

Same test on the C target release build with full optimizations for speed
(including ANTLR3_INLINE_INPUT_UTF16):


* Overhead (32-bit): 148 bytes/token
* Total parse time: 5.88s
* Rate: 4.76 mil tokens/sec

The tree implementation I proposed for C# offers a significant raw
performance (speed) boost over the C target with optimizations, but uses
less than 10 bytes/token. :)

I imagine you could pick up a lot by sharing the API portion of your tokens
(a pointer to a struct with the shared function pointers).

Sam

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Jim Idle
Sent: Wednesday, December 01, 2010 2:57 PM
Cc: [email protected]
Subject: Re: [antlr-dev] Alternative token storage mechanisms

It does show how much overhead there is to such languages compared to C
though :-)

Jim

> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Terence Parr
> Sent: Wednesday, December 01, 2010 12:50 PM
> To: Sam Harwell
> Cc: Johannes Luber ([email protected]); [email protected]
> Subject: Re: [antlr-dev] Alternative token storage mechanisms
> 
> Hi Sam. Impressive.  Is this all due to no object creation overhead?
> Ter
> On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote:
> 
> > Hi Dr. Parr,
> >
> > I revisited my old "slim parsing" work to again measure the
> performance difference against Lexer/CommonToken. Currently, 
> SlimLexer/SlimToken has a limitation that it only stores type, 
> channel, startIndex, and stopIndex. Each of these is limited to 16 bits.
> Originally I planned to use this for syntax highlighting, where I can 
> work within those bounds. Now the basic metrics. These were tested on 
> the following 4-function calculator lexer.
> >
> > tokens {
> >         MUL='*';
> >         DIV='/';
> >         MOD='%';
> >         ADD='+';
> >         SUB='-';
> > }
> >
> > IDENTIFIER
> >         :       ('a'..'z' | 'A'..'Z' | '_')
> >                 ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
> >         ;
> >
> > NUMBER
> >         :       '0'..'9'+
> >         ;
> >
> > WS
> >         :       (' ' | '\t' | '\n' | '\r' | '\f')*
> >                 {$channel = Hidden;}
> >         ;
> >
> > Memory - CommonToken (32-bit system):
> > .         8 bytes overhead for being a class
> > .         36 bytes overhead for member variables
> >
> > Memory - CommonToken (64-bit system):
> > .         16 bytes overhead for being a class (I believe that's the
> object header size)
> > .         44 bytes overhead for members
> >
> > Memory - SlimToken (32- or 64-bit systems):
> > .         8 bytes total storage, and no allocations since it's a
> value type.
> >
> > Lexer speed - CommonToken:
> > .         Total time: 10.34s
> > .         Rate: 2.71 mil tokens/sec
> >
> > Lexer speed - SlimToken:
> > .         Total time 2.87s
> > .         Rate: 9.76 mil tokens/sec
> >
> > My goal is to add enough CommonToken features back to SlimToken to
> make it usable without breaking its performance characteristics. To do 
> so, I'm working on a new revision of SlimLexer that holds a ShortToken 
> (backed by 32-bit int) or LongToken (backed by 64-bit int) (the lexer 
> is generic in C#). The token itself stores its type (low 8-bits of 
> ShortToken, 16-bits of LongToken), a flag of whether it's on the 
> default channel or not (+/-), and 23- or 47-bits for the token index).
> As the lexer runs, it builds B-tree indexes for line lengths, token 
> offset and (with token lengths derived). It also holds a map from
> Token->string so that it only has to track text when necessary. This
> gives O(1) access to the values that drive decision making (with 
> (value & 0xF) giving the token type for ShortToken), and O(log_b(n)) 
> access to other values. I expect to see a great improvement in 
> performance with a very practical token for real parsing tasks.
> >
> > Sam
> 
> _______________________________________________
> antlr-dev mailing list
> [email protected]
> http://www.antlr.org/mailman/listinfo/antlr-dev

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Re: [antlr-dev] Alternative token storage mechanisms

Reply via email to