[il-antlr-interest: 28148] Re: [antlr-interest] Bounding the token stream in the C backend

Jim Idle Tue, 02 Mar 2010 23:37:36 -0800

Your points are well taken, but there are some misunderstandings here. I have 
tried to say on a number of occasions that the string factory thing coming from 
$xyz.text is just a convenience thing that should not be used in any 
environment that requires performance/memory efficiency. It is just a 
reasonable way to access the pointers and stuff.

However, look guys, this is C!! By which I mean, for real efficiency, you 
should be accessing things such as the text of the token via the pointers in 
the token and not via the artifice of $text. In the next release I will 
document this better and I apologize for not having done so up to press. There 
are also lots of macros and switches you can set that will improve performance 
a lot, and the upcoming release has lots of performance improvements. For 
comparison, I am currently working on a parser for IBM that is 7X faster than 
the 2.7.x C++ equivalent. Once again, I apologize for not documenting all of 
this stuff as well as it might be, but the code itself is well documented; 
there just needs to be more usage docs I think.

Thanks for the feedback,

jim

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Christopher L Conway
Sent: Tuesday, March 02, 2010 6:44 PM
To: [email protected]
Subject: Re: [antlr-interest] Bounding the token stream in the C backend

On Thu, Feb 25, 2010 at 12:09 AM, Christopher L Conway
<[email protected]> wrote:
> I've got a large input file (~39MB) that I'm attempting to parse with
> an ANTLR3-generated C parser. The parser is using a huge amount of
> memory (~3.7GB) and seems to start thrashing without making much
> progress towards termination. I found a thread from earlier this month
> (http://markmail.org/message/jfngdd2ci6h7qrbo) suggesting the most
> likely cause of such behavior is a parser bug, but I've stepped
> through the code and it seems to be lexing just fine. Rather, it seems
> the problem is that fillBuffer() is tokenizing the whole file in one
> go; then, the parsing rules slow to a crawl because the token buffer
> is sitting on all the memory.
>
> I wonder if there is a way to change fillBuffer()'s behavior, so that
> it will only lex some bounded number of tokens before allowing parsing
> to proceed?

I have a partial solution to this problem. To be clear, the issue is:

1. The default token stream implementation tokenizes the entire input
on the first call to LT().
2. The default token factory never de-allocates tokens.

Since a token structure is more than 100 bytes, large inputs can
easily consume multiple GB before parsing even begins. (This is in the
C back-end. I don't know about other back-ends.)

The solution consists of:

1. A token stream implementation that tokenizes up to a fixed lookahead limit.
2. A token factory that pre-allocates a fixed number of tokens,
recycling old tokens when new ones are requested.

This seems to be a sound strategy, so long as the input grammar has an
known lookahead limit and the allocation pool is sufficiently large.
My grammar is LL(2), and a lookahead limit of 2 with a token pool of 8
tokens works just fine.

Using this implementation, I'm able to parse the above-mentioned 39MB
input file using less than 700MB memory, a more than 5-fold
improvement on the defaults (as an added benefit, the parser actually
terminated after 45s and didn't completely lock my system!). The
parser is about 40% faster than an equivalent ANTLR 2.7 parser using
the C++ back-end, but still uses about 5 times as much memory. The
remaining excess memory usage seems to be due to the default string
factory implementation, which also doesn't seem to ever release memory
once allocated. This is a much more complex beast and I'm hesitant to
tackle it.

If anybody is interested in using this code, I'm willing to clean it
up and share it with the community. Please feel free to contact me.

Regards,
Chris

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

-- 
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en.

[il-antlr-interest: 28148] Re: [antlr-interest] Bounding the token stream in the C backend

Reply via email to