After looking this over some more, I think I've figured out a way to bring this to the Java runtime. Without user-defined value types, a basic implementation in Java would require tokens to be a [completely type-unsafe] int. However, my implementation is focused on the interaction between FastLexer and FastTokenStream. You could use int[] for the token caches within these classes, and expose a FastToken as necessary which wraps the int into an IToken implementation. In my implementation, FastToken is just an int (inside a struct), but FastTokenWrapper<FastToken> is the actual IToken implementation and includes the FastToken and a reference to the FastLexer necessary to resolve all the IToken information.
You'd lose the removal of object creation overhead that helped the speed so much, but you'd get the full memory savings benefit of storing the buffer as an array of lightweight primitives (5:1 memory savings in a compiler is non-trivial). I have a lot of work left to do polishing this before I can check it in, but I like the promise it shows. I look forward to incorporating it into the ports of Tool and StringTemplate and in my UnrealScript project for real-world performance testing. Sam -----Original Message----- From: Sam Harwell [mailto:[email protected]] Sent: Wednesday, December 01, 2010 10:26 PM To: 'Jim Idle' Cc: '[email protected]' Subject: RE: [antlr-dev] Alternative token storage mechanisms I attached a copy of the grammar and test host for C. It's compiled against the latest C runtime in P4. All test specs and the grammar are identical to the C# version. I put together an initial implementation of the new storage. I haven't run it through the profiler at all yet. I did make sure it passes this set of tests against the behavior of CommonTokenStream/CommonToken: // check for (int i = 0; i < tokens.Count; i++) { IToken a = tokens.Get(i); IToken b = ((ITokenStream)fastTokens).Get(i); Assert.AreEqual(a.Channel, b.Channel); Assert.AreEqual(a.CharPositionInLine, b.CharPositionInLine); Assert.AreEqual(a.Line, b.Line); Assert.AreEqual(a.StartIndex, b.StartIndex); Assert.AreEqual(a.StopIndex, b.StopIndex); Assert.AreEqual(a.Text, b.Text); Assert.AreEqual(a.TokenIndex, b.TokenIndex); Assert.AreEqual(a.Type, b.Type); } Performance is now: * Elapsed time: 3.52s * Throughput: 7.96mil tokens/sec (Triple the speed, 1/5 the memory, no loss of features on the IToken API). Sam -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Jim Idle Sent: Wednesday, December 01, 2010 6:10 PM Cc: [email protected] Subject: Re: [antlr-dev] Alternative token storage mechanisms Right, but if I make my tokens like yours and carry no information to speak of. And also pre-create the collections etc. Also, send me your tests and I will have a look as those times do not seem correct to me. Jim > -----Original Message----- > From: Sam Harwell [mailto:[email protected]] > Sent: Wednesday, December 01, 2010 3:28 PM > To: 'Jim Idle' > Cc: [email protected] > Subject: RE: [antlr-dev] Alternative token storage mechanisms > > Same test on the C target release build with full optimizations for > speed (including ANTLR3_INLINE_INPUT_UTF16): > > * Overhead (32-bit): 148 bytes/token > * Total parse time: 5.88s > * Rate: 4.76 mil tokens/sec > > The tree implementation I proposed for C# offers a significant raw > performance (speed) boost over the C target with optimizations, but > uses less than 10 bytes/token. :) > > I imagine you could pick up a lot by sharing the API portion of your > tokens (a pointer to a struct with the shared function pointers). > > Sam > > -----Original Message----- > From: [email protected] [mailto:[email protected]] > On Behalf Of Jim Idle > Sent: Wednesday, December 01, 2010 2:57 PM > Cc: [email protected] > Subject: Re: [antlr-dev] Alternative token storage mechanisms > > It does show how much overhead there is to such languages compared to > C though :-) > > Jim > > > -----Original Message----- > > From: [email protected] [mailto:antlr-dev- > [email protected]] > > On Behalf Of Terence Parr > > Sent: Wednesday, December 01, 2010 12:50 PM > > To: Sam Harwell > > Cc: Johannes Luber ([email protected]); [email protected] > > Subject: Re: [antlr-dev] Alternative token storage mechanisms > > > > Hi Sam. Impressive. Is this all due to no object creation overhead? > > Ter > > On Dec 1, 2010, at 8:03 AM, Sam Harwell wrote: > > > > > Hi Dr. Parr, > > > > > > I revisited my old "slim parsing" work to again measure the > > performance difference against Lexer/CommonToken. Currently, > > SlimLexer/SlimToken has a limitation that it only stores type, > > channel, startIndex, and stopIndex. Each of these is limited to 16 > bits. > > Originally I planned to use this for syntax highlighting, where I > > can work within those bounds. Now the basic metrics. These were > > tested on the following 4-function calculator lexer. > > > > > > tokens { > > > MUL='*'; > > > DIV='/'; > > > MOD='%'; > > > ADD='+'; > > > SUB='-'; > > > } > > > > > > IDENTIFIER > > > : ('a'..'z' | 'A'..'Z' | '_') > > > ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')* > > > ; > > > > > > NUMBER > > > : '0'..'9'+ > > > ; > > > > > > WS > > > : (' ' | '\t' | '\n' | '\r' | '\f')* > > > {$channel = Hidden;} > > > ; > > > > > > Memory - CommonToken (32-bit system): > > > . 8 bytes overhead for being a class > > > . 36 bytes overhead for member variables > > > > > > Memory - CommonToken (64-bit system): > > > . 16 bytes overhead for being a class (I believe that's the > > object header size) > > > . 44 bytes overhead for members > > > > > > Memory - SlimToken (32- or 64-bit systems): > > > . 8 bytes total storage, and no allocations since it's a > > value type. > > > > > > Lexer speed - CommonToken: > > > . Total time: 10.34s > > > . Rate: 2.71 mil tokens/sec > > > > > > Lexer speed - SlimToken: > > > . Total time 2.87s > > > . Rate: 9.76 mil tokens/sec > > > > > > My goal is to add enough CommonToken features back to SlimToken to > > make it usable without breaking its performance characteristics. To > do > > so, I'm working on a new revision of SlimLexer that holds a > ShortToken > > (backed by 32-bit int) or LongToken (backed by 64-bit int) (the > > lexer is generic in C#). The token itself stores its type (low > > 8-bits of ShortToken, 16-bits of LongToken), a flag of whether it's > > on the default channel or not (+/-), and 23- or 47-bits for the > > token > index). > > As the lexer runs, it builds B-tree indexes for line lengths, token > > offset and (with token lengths derived). It also holds a map from > > Token->string so that it only has to track text when necessary. This > > gives O(1) access to the values that drive decision making (with > > (value & 0xF) giving the token type for ShortToken), and O(log_b(n)) > > access to other values. I expect to see a great improvement in > > performance with a very practical token for real parsing tasks. > > > > > > Sam > > > > _______________________________________________ > > antlr-dev mailing list > > [email protected] > > http://www.antlr.org/mailman/listinfo/antlr-dev > > _______________________________________________ > antlr-dev mailing list > [email protected] > http://www.antlr.org/mailman/listinfo/antlr-dev _______________________________________________ antlr-dev mailing list [email protected] http://www.antlr.org/mailman/listinfo/antlr-dev _______________________________________________ antlr-dev mailing list [email protected] http://www.antlr.org/mailman/listinfo/antlr-dev
