Re: [antlr-dev] Alternative token storage mechanisms

Terence Parr Wed, 01 Dec 2010 16:06:19 -0800

wow.  So only disadvantage is not having a real object  per token?
Ter
On Dec 1, 2010, at 3:28 PM, Sam Harwell wrote:


> Same test on the C target release build with full optimizations for speed
> (including ANTLR3_INLINE_INPUT_UTF16):
> 
> * Overhead (32-bit): 148 bytes/token
> * Total parse time: 5.88s
> * Rate: 4.76 mil tokens/sec
> 
> The tree implementation I proposed for C# offers a significant raw
> performance (speed) boost over the C target with optimizations, but uses
> less than 10 bytes/token. :)
> 
> I imagine you could pick up a lot by sharing the API portion of your tokens
> (a pointer to a struct with the shared function pointers).
> 
> Sam
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Jim Idle
> Sent: Wednesday, December 01, 2010 2:57 PM
> Cc: [email protected]
> Subject: Re: [antlr-dev] Alternative token storage mechanisms
> 
> It does show how much overhead there is to such languages compared to C
> though :-)
> 
> Jim
> 
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]]
>> On Behalf Of Terence Parr
>> Sent: Wednesday, December 01, 2010 12:50 PM
>> To: Sam Harwell
>> Cc: Johannes Luber ([email protected]); [email protected]
>> Subject: Re: [antlr-dev] Alternative token storage mechanisms
>> 
>> Hi Sam. Impressive.  Is this all due to no object creation overhead?
>> Ter
>> On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote:
>> 
>>> Hi Dr. Parr,
>>> 
>>> I revisited my old "slim parsing" work to again measure the
>> performance difference against Lexer/CommonToken. Currently, 
>> SlimLexer/SlimToken has a limitation that it only stores type, 
>> channel, startIndex, and stopIndex. Each of these is limited to 16 bits.
>> Originally I planned to use this for syntax highlighting, where I can 
>> work within those bounds. Now the basic metrics. These were tested on 
>> the following 4-function calculator lexer.
>>> 
>>> tokens {
>>>        MUL='*';
>>>        DIV='/';
>>>        MOD='%';
>>>        ADD='+';
>>>        SUB='-';
>>> }
>>> 
>>> IDENTIFIER
>>>        :       ('a'..'z' | 'A'..'Z' | '_')
>>>                ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
>>>        ;
>>> 
>>> NUMBER
>>>        :       '0'..'9'+
>>>        ;
>>> 
>>> WS
>>>        :       (' ' | '\t' | '\n' | '\r' | '\f')*
>>>                {$channel = Hidden;}
>>>        ;
>>> 
>>> Memory - CommonToken (32-bit system):
>>> .         8 bytes overhead for being a class
>>> .         36 bytes overhead for member variables
>>> 
>>> Memory - CommonToken (64-bit system):
>>> .         16 bytes overhead for being a class (I believe that's the
>> object header size)
>>> .         44 bytes overhead for members
>>> 
>>> Memory - SlimToken (32- or 64-bit systems):
>>> .         8 bytes total storage, and no allocations since it's a
>> value type.
>>> 
>>> Lexer speed - CommonToken:
>>> .         Total time: 10.34s
>>> .         Rate: 2.71 mil tokens/sec
>>> 
>>> Lexer speed - SlimToken:
>>> .         Total time 2.87s
>>> .         Rate: 9.76 mil tokens/sec
>>> 
>>> My goal is to add enough CommonToken features back to SlimToken to
>> make it usable without breaking its performance characteristics. To do 
>> so, I'm working on a new revision of SlimLexer that holds a ShortToken 
>> (backed by 32-bit int) or LongToken (backed by 64-bit int) (the lexer 
>> is generic in C#). The token itself stores its type (low 8-bits of 
>> ShortToken, 16-bits of LongToken), a flag of whether it's on the 
>> default channel or not (+/-), and 23- or 47-bits for the token index).
>> As the lexer runs, it builds B-tree indexes for line lengths, token 
>> offset and (with token lengths derived). It also holds a map from
>> Token->string so that it only has to track text when necessary. This
>> gives O(1) access to the values that drive decision making (with 
>> (value & 0xF) giving the token type for ShortToken), and O(log_b(n)) 
>> access to other values. I expect to see a great improvement in 
>> performance with a very practical token for real parsing tasks.
>>> 
>>> Sam
>> 
>> _______________________________________________
>> antlr-dev mailing list
>> [email protected]
>> http://www.antlr.org/mailman/listinfo/antlr-dev
> 
> _______________________________________________
> antlr-dev mailing list
> [email protected]
> http://www.antlr.org/mailman/listinfo/antlr-dev
> 
> _______________________________________________
> antlr-dev mailing list
> [email protected]
> http://www.antlr.org/mailman/listinfo/antlr-dev

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Re: [antlr-dev] Alternative token storage mechanisms

Reply via email to