Re: [antlr-dev] Alternative token storage mechanisms

Sam Harwell Thu, 02 Dec 2010 09:02:29 -0800

After looking this over some more, I think I've figured out a way to bring
this to the Java runtime. Without user-defined value types, a basic
implementation in Java would require tokens to be a [completely type-unsafe]
int. However, my implementation is focused on the interaction between
FastLexer and FastTokenStream. You could use int[] for the token caches
within these classes, and expose a FastToken as necessary which wraps the
int into an IToken implementation. In my implementation, FastToken is just
an int (inside a struct), but FastTokenWrapper<FastToken> is the actual
IToken implementation and includes the FastToken and a reference to the
FastLexer necessary to resolve all the IToken information.


You'd lose the removal of object creation overhead that helped the speed so
much, but you'd get the full memory savings benefit of storing the buffer as
an array of lightweight primitives (5:1 memory savings in a compiler is
non-trivial).

I have a lot of work left to do polishing this before I can check it in, but
I like the promise it shows. I look forward to incorporating it into the
ports of Tool and StringTemplate and in my UnrealScript project for
real-world performance testing.

Sam

-----Original Message-----
From: Sam Harwell [mailto:[email protected]] 
Sent: Wednesday, December 01, 2010 10:26 PM
To: 'Jim Idle'
Cc: '[email protected]'
Subject: RE: [antlr-dev] Alternative token storage mechanisms

I attached a copy of the grammar and test host for C. It's compiled against
the latest C runtime in P4. All test specs and the grammar are identical to
the C# version.

I put together an initial implementation of the new storage. I haven't run
it through the profiler at all yet. I did make sure it passes this set of
tests against the behavior of CommonTokenStream/CommonToken:

// check
for (int i = 0; i < tokens.Count; i++)
{
    IToken a = tokens.Get(i);
    IToken b = ((ITokenStream)fastTokens).Get(i);

    Assert.AreEqual(a.Channel, b.Channel);
    Assert.AreEqual(a.CharPositionInLine, b.CharPositionInLine);
    Assert.AreEqual(a.Line, b.Line);
    Assert.AreEqual(a.StartIndex, b.StartIndex);
    Assert.AreEqual(a.StopIndex, b.StopIndex);
    Assert.AreEqual(a.Text, b.Text);
    Assert.AreEqual(a.TokenIndex, b.TokenIndex);
    Assert.AreEqual(a.Type, b.Type);
}

Performance is now:
* Elapsed time: 3.52s
* Throughput: 7.96mil tokens/sec

(Triple the speed, 1/5 the memory, no loss of features on the IToken API).

Sam

-----Original Message-----
From: [email protected] [mailto:[email protected]] On
Behalf Of Jim Idle
Sent: Wednesday, December 01, 2010 6:10 PM
Cc: [email protected]
Subject: Re: [antlr-dev] Alternative token storage mechanisms

Right, but if I make my tokens like yours and carry no information to speak
of. And also pre-create the collections etc.

Also, send me your tests and I will have a look as those times do not seem
correct to me.

Jim

> -----Original Message-----
> From: Sam Harwell [mailto:[email protected]]
> Sent: Wednesday, December 01, 2010 3:28 PM
> To: 'Jim Idle'
> Cc: [email protected]
> Subject: RE: [antlr-dev] Alternative token storage mechanisms
> 
> Same test on the C target release build with full optimizations for 
> speed (including ANTLR3_INLINE_INPUT_UTF16):
> 
> * Overhead (32-bit): 148 bytes/token
> * Total parse time: 5.88s
> * Rate: 4.76 mil tokens/sec
> 
> The tree implementation I proposed for C# offers a significant raw 
> performance (speed) boost over the C target with optimizations, but 
> uses less than 10 bytes/token. :)
> 
> I imagine you could pick up a lot by sharing the API portion of your 
> tokens (a pointer to a struct with the shared function pointers).
> 
> Sam
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Jim Idle
> Sent: Wednesday, December 01, 2010 2:57 PM
> Cc: [email protected]
> Subject: Re: [antlr-dev] Alternative token storage mechanisms
> 
> It does show how much overhead there is to such languages compared to 
> C though :-)
> 
> Jim
> 
> > -----Original Message-----
> > From: [email protected] [mailto:antlr-dev-
> [email protected]]
> > On Behalf Of Terence Parr
> > Sent: Wednesday, December 01, 2010 12:50 PM
> > To: Sam Harwell
> > Cc: Johannes Luber ([email protected]); [email protected]
> > Subject: Re: [antlr-dev] Alternative token storage mechanisms
> >
> > Hi Sam. Impressive.  Is this all due to no object creation overhead?
> > Ter
> > On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote:
> >
> > > Hi Dr. Parr,
> > >
> > > I revisited my old "slim parsing" work to again measure the
> > performance difference against Lexer/CommonToken. Currently, 
> > SlimLexer/SlimToken has a limitation that it only stores type, 
> > channel, startIndex, and stopIndex. Each of these is limited to 16
> bits.
> > Originally I planned to use this for syntax highlighting, where I 
> > can work within those bounds. Now the basic metrics. These were 
> > tested on the following 4-function calculator lexer.
> > >
> > > tokens {
> > >         MUL='*';
> > >         DIV='/';
> > >         MOD='%';
> > >         ADD='+';
> > >         SUB='-';
> > > }
> > >
> > > IDENTIFIER
> > >         :       ('a'..'z' | 'A'..'Z' | '_')
> > >                 ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
> > >         ;
> > >
> > > NUMBER
> > >         :       '0'..'9'+
> > >         ;
> > >
> > > WS
> > >         :       (' ' | '\t' | '\n' | '\r' | '\f')*
> > >                 {$channel = Hidden;}
> > >         ;
> > >
> > > Memory - CommonToken (32-bit system):
> > > .         8 bytes overhead for being a class
> > > .         36 bytes overhead for member variables
> > >
> > > Memory - CommonToken (64-bit system):
> > > .         16 bytes overhead for being a class (I believe that's the
> > object header size)
> > > .         44 bytes overhead for members
> > >
> > > Memory - SlimToken (32- or 64-bit systems):
> > > .         8 bytes total storage, and no allocations since it's a
> > value type.
> > >
> > > Lexer speed - CommonToken:
> > > .         Total time: 10.34s
> > > .         Rate: 2.71 mil tokens/sec
> > >
> > > Lexer speed - SlimToken:
> > > .         Total time 2.87s
> > > .         Rate: 9.76 mil tokens/sec
> > >
> > > My goal is to add enough CommonToken features back to SlimToken to
> > make it usable without breaking its performance characteristics. To
> do
> > so, I'm working on a new revision of SlimLexer that holds a
> ShortToken
> > (backed by 32-bit int) or LongToken (backed by 64-bit int) (the 
> > lexer is generic in C#). The token itself stores its type (low 
> > 8-bits of ShortToken, 16-bits of LongToken), a flag of whether it's 
> > on the default channel or not (+/-), and 23- or 47-bits for the 
> > token
> index).
> > As the lexer runs, it builds B-tree indexes for line lengths, token 
> > offset and (with token lengths derived). It also holds a map from
> > Token->string so that it only has to track text when necessary. This
> > gives O(1) access to the values that drive decision making (with 
> > (value & 0xF) giving the token type for ShortToken), and O(log_b(n)) 
> > access to other values. I expect to see a great improvement in 
> > performance with a very practical token for real parsing tasks.
> > >
> > > Sam
> >
> > _______________________________________________
> > antlr-dev mailing list
> > [email protected]
> > http://www.antlr.org/mailman/listinfo/antlr-dev
> 
> _______________________________________________
> antlr-dev mailing list
> [email protected]
> http://www.antlr.org/mailman/listinfo/antlr-dev


_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Re: [antlr-dev] Alternative token storage mechanisms

Reply via email to