On Thu, Feb 25, 2010 at 12:09 AM, Christopher L Conway <[email protected]> wrote: > I've got a large input file (~39MB) that I'm attempting to parse with > an ANTLR3-generated C parser. The parser is using a huge amount of > memory (~3.7GB) and seems to start thrashing without making much > progress towards termination. I found a thread from earlier this month > (http://markmail.org/message/jfngdd2ci6h7qrbo) suggesting the most > likely cause of such behavior is a parser bug, but I've stepped > through the code and it seems to be lexing just fine. Rather, it seems > the problem is that fillBuffer() is tokenizing the whole file in one > go; then, the parsing rules slow to a crawl because the token buffer > is sitting on all the memory. > > I wonder if there is a way to change fillBuffer()'s behavior, so that > it will only lex some bounded number of tokens before allowing parsing > to proceed?
I have a partial solution to this problem. To be clear, the issue is: 1. The default token stream implementation tokenizes the entire input on the first call to LT(). 2. The default token factory never de-allocates tokens. Since a token structure is more than 100 bytes, large inputs can easily consume multiple GB before parsing even begins. (This is in the C back-end. I don't know about other back-ends.) The solution consists of: 1. A token stream implementation that tokenizes up to a fixed lookahead limit. 2. A token factory that pre-allocates a fixed number of tokens, recycling old tokens when new ones are requested. This seems to be a sound strategy, so long as the input grammar has an known lookahead limit and the allocation pool is sufficiently large. My grammar is LL(2), and a lookahead limit of 2 with a token pool of 8 tokens works just fine. Using this implementation, I'm able to parse the above-mentioned 39MB input file using less than 700MB memory, a more than 5-fold improvement on the defaults (as an added benefit, the parser actually terminated after 45s and didn't completely lock my system!). The parser is about 40% faster than an equivalent ANTLR 2.7 parser using the C++ back-end, but still uses about 5 times as much memory. The remaining excess memory usage seems to be due to the default string factory implementation, which also doesn't seem to ever release memory once allocated. This is a much more complex beast and I'm hesitant to tackle it. If anybody is interested in using this code, I'm willing to clean it up and share it with the community. Please feel free to contact me. Regards, Chris List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
