Re: [antlr-dev] Interesting Lexer performance results

Sam Harwell Sun, 03 May 2009 15:28:12 -0700

Hi Jim,


Unfortunately, I won't have the chance to do instrumented profiling
until a couple weeks from now. I do have a lot of tricks I hope to
implement over time to improve the speed of my grammars with similar
results across the board. :)

 

My machine is 32-bit XP, 2x Athlon FX-70, though the test only uses 1 of
the 4 cores. Here is the very simple lexer I'm using:

 

//

// LEXER

//

 

IDENTIFIER

        :       ('a'..'z' | 'A'..'Z' | '_')

                ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*

        ;

 

NUMBER

        :       '0'..'9'+

        ;

 

WS

        :       (' ' | '\t' | '\n' | '\r' | '\f')

                {$channel = Hidden;}

        ;

 

 

From: Jim Idle [mailto:[email protected]] 
Sent: Sunday, May 03, 2009 5:14 PM
To: Sam Harwell
Cc: ANTLR-dev Dev
Subject: Re: [antlr-dev] Interesting Lexer performance results

 

Sam Harwell wrote: 

Today I decided to try and evaluate the potential performance benefits
of a "lightweight" lexer mode. I find that I often don't need/use many
of the items in the token, with the limit being syntax highlighters that
only need the token type and start index in the line. For my experiment,
I did the following:

 

Create the generic interfaces ITokenSource<T>, and ITokenStream<T>

Create the generic classes Lexer<T> and TokenStream<T> with no virtual
functions in the fast-path, including working on a string instead of one
of the ICharStream types.

Create a struct (in C#, this is an unboxed value type) with 2 shorts for
a total token size of 32 bits.

 

The test lexer recognizes C-style identifiers, whitespace, and integers.
One copy is derived from Lexer, and the other from Lexer<T>.

 

The input for a single iteration is 25000000 Unicode chars, generated
from 1000000 copies of "x-2356*Abte+32+eno/6623+y". I ran 5 iterations
of each lexer before starting the timer to allow the JIT to compile the
hot methods. I then timed 5 iterations of each, and here is the sum
result:

 

Elapsed time (normal): 43.546875 seconds.

Elapsed time (fast): 7.078125 seconds.

 

Summary: For a particular task I perform very often, deriving from some
slightly altered base classes yielded a 6:1 time improvement,
substantially lowered memory overhead, and did not lose any information
I needed. I'll certainly be examining possibilities for wider use of
this work in the future.

 

Hi Sam,

Send along your lexer, I would like to see how this compares with C (I
presume your measurements are C#?). Also, what does profiling tell you
about the difference in time? Object creation? Of course it is a fairly
simple lexer, but in this case I think it is valid because then the time
differences are isolated to those things that are to do with more
complicated tokens. 

I was going to do a simple C version oas a target, but having used it in
anger, C is already fast enough.  We do need to do some performance
improvement work, but I suspect that this will really happen when Ter is
freed from having to work for a living for a while, coming up soon ;-)

Jim

_______________________________________________
antlr-dev mailing list
[email protected]
http://www.antlr.org/mailman/listinfo/antlr-dev

Re: [antlr-dev] Interesting Lexer performance results

Reply via email to