Your points are well taken, but there are some misunderstandings here. I have tried to say on a number of occasions that the string factory thing coming from $xyz.text is just a convenience thing that should not be used in any environment that requires performance/memory efficiency. It is just a reasonable way to access the pointers and stuff.
However, look guys, this is C!! By which I mean, for real efficiency, you should be accessing things such as the text of the token via the pointers in the token and not via the artifice of $text. In the next release I will document this better and I apologize for not having done so up to press. There are also lots of macros and switches you can set that will improve performance a lot, and the upcoming release has lots of performance improvements. For comparison, I am currently working on a parser for IBM that is 7X faster than the 2.7.x C++ equivalent. Once again, I apologize for not documenting all of this stuff as well as it might be, but the code itself is well documented; there just needs to be more usage docs I think. Thanks for the feedback, jim -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Christopher L Conway Sent: Tuesday, March 02, 2010 6:44 PM To: [email protected] Subject: Re: [antlr-interest] Bounding the token stream in the C backend On Thu, Feb 25, 2010 at 12:09 AM, Christopher L Conway <[email protected]> wrote: > I've got a large input file (~39MB) that I'm attempting to parse with > an ANTLR3-generated C parser. The parser is using a huge amount of > memory (~3.7GB) and seems to start thrashing without making much > progress towards termination. I found a thread from earlier this month > (http://markmail.org/message/jfngdd2ci6h7qrbo) suggesting the most > likely cause of such behavior is a parser bug, but I've stepped > through the code and it seems to be lexing just fine. Rather, it seems > the problem is that fillBuffer() is tokenizing the whole file in one > go; then, the parsing rules slow to a crawl because the token buffer > is sitting on all the memory. > > I wonder if there is a way to change fillBuffer()'s behavior, so that > it will only lex some bounded number of tokens before allowing parsing > to proceed? I have a partial solution to this problem. To be clear, the issue is: 1. The default token stream implementation tokenizes the entire input on the first call to LT(). 2. The default token factory never de-allocates tokens. Since a token structure is more than 100 bytes, large inputs can easily consume multiple GB before parsing even begins. (This is in the C back-end. I don't know about other back-ends.) The solution consists of: 1. A token stream implementation that tokenizes up to a fixed lookahead limit. 2. A token factory that pre-allocates a fixed number of tokens, recycling old tokens when new ones are requested. This seems to be a sound strategy, so long as the input grammar has an known lookahead limit and the allocation pool is sufficiently large. My grammar is LL(2), and a lookahead limit of 2 with a token pool of 8 tokens works just fine. Using this implementation, I'm able to parse the above-mentioned 39MB input file using less than 700MB memory, a more than 5-fold improvement on the defaults (as an added benefit, the parser actually terminated after 45s and didn't completely lock my system!). The parser is about 40% faster than an equivalent ANTLR 2.7 parser using the C++ back-end, but still uses about 5 times as much memory. The remaining excess memory usage seems to be due to the default string factory implementation, which also doesn't seem to ever release memory once allocated. This is a much more complex beast and I'm hesitant to tackle it. If anybody is interested in using this code, I'm willing to clean it up and share it with the community. Please feel free to contact me. Regards, Chris List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
