Today I decided to try and evaluate the potential performance benefits
of a "lightweight" lexer mode. I find that I often don't need/use many
of the items in the token, with the limit being syntax highlighters that
only need the token type and start index in the line. For my experiment,
I did the following:

 

*         Create the generic interfaces ITokenSource<T>, and
ITokenStream<T>

*         Create the generic classes Lexer<T> and TokenStream<T> with no
virtual functions in the fast-path, including working on a string
instead of one of the ICharStream types.

*         Create a struct (in C#, this is an unboxed value type) with 2
shorts for a total token size of 32 bits.

 

The test lexer recognizes C-style identifiers, whitespace, and integers.
One copy is derived from Lexer, and the other from Lexer<T>.

 

The input for a single iteration is 25000000 Unicode chars, generated
from 1000000 copies of "x-2356*Abte+32+eno/6623+y". I ran 5 iterations
of each lexer before starting the timer to allow the JIT to compile the
hot methods. I then timed 5 iterations of each, and here is the sum
result:

 

Elapsed time (normal): 43.546875 seconds.

Elapsed time (fast): 7.078125 seconds.

 

Summary: For a particular task I perform very often, deriving from some
slightly altered base classes yielded a 6:1 time improvement,
substantially lowered memory overhead, and did not lose any information
I needed. I'll certainly be examining possibilities for wider use of
this work in the future.

 

Sam Harwell

Pixel Mine, Inc.

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Reply via email to