Today I decided to try and evaluate the potential performance benefits of a "lightweight" lexer mode. I find that I often don't need/use many of the items in the token, with the limit being syntax highlighters that only need the token type and start index in the line. For my experiment, I did the following:
* Create the generic interfaces ITokenSource<T>, and ITokenStream<T> * Create the generic classes Lexer<T> and TokenStream<T> with no virtual functions in the fast-path, including working on a string instead of one of the ICharStream types. * Create a struct (in C#, this is an unboxed value type) with 2 shorts for a total token size of 32 bits. The test lexer recognizes C-style identifiers, whitespace, and integers. One copy is derived from Lexer, and the other from Lexer<T>. The input for a single iteration is 25000000 Unicode chars, generated from 1000000 copies of "x-2356*Abte+32+eno/6623+y". I ran 5 iterations of each lexer before starting the timer to allow the JIT to compile the hot methods. I then timed 5 iterations of each, and here is the sum result: Elapsed time (normal): 43.546875 seconds. Elapsed time (fast): 7.078125 seconds. Summary: For a particular task I perform very often, deriving from some slightly altered base classes yielded a 6:1 time improvement, substantially lowered memory overhead, and did not lose any information I needed. I'll certainly be examining possibilities for wider use of this work in the future. Sam Harwell Pixel Mine, Inc.
_______________________________________________ antlr-dev mailing list [email protected] http://www.antlr.org/mailman/listinfo/antlr-dev
