Sorry about the misunderstanding there. I've done some extensive work on lexer performance, but it was focused on source files only up to a couple of dozen megabytes in a single file. ANTLR for Java is certainly not equipped to handle large-scale operations even in the scale I was testing due to some fundamental language limitations. Using carefully written grammars and my experimental "SlimLexer" implementation for the CSharp3 target, I've achieved rates of approximately 10MB of source per second which *significantly* outperformed even the C target.
The lexer implementation planned for ANTLR v4 should approach (and hopefully exceed) the performance of my SlimLexer, but I don't think there's any intention to consider gigabytes of source code. On a side note, I'm assuming the dozens of gigabytes weren't handwritten are the result of an intermediate tool in the compiler tool chain. I would treat this as a substantial, unacceptable design flaw in a system designed for business use. Any practical system for data on this scale use data formats and layouts which can be efficiently manipulated for the desired information. This is like replacing a long hallway in an office building with a maze and complaining that it takes too long to get to the bathroom and wondering if go-karts might help. The problem exists way before parsing is ever considered. Sam -----Original Message----- From: Martin d'Anjou [mailto:[email protected]] Sent: Tuesday, March 29, 2011 11:55 PM To: Sam Harwell Cc: [email protected] Subject: Re: [antlr-interest] antlr v4 wish list Hi Sam, With regards to your answer to item 4) Gigantic files, I meant the problem of lexing and parsing gigantic source files such as verilog netlists which can be dozens of gigabytes of source code and take hours to lex and parse due to their size. The problem is reported by http://v2kparse.blogspot.com/2008/06/first-pass-uploaded-to-sourceforce.html . To quote his blog: "I was compelled to use ANTLR 2.7.7 since the token stream mechanism does not try to slurp in the whole source file, an issue which I encountered with the more recent ANTLR 3.0. While Verilog source files are not generally large, netlist files can be humungous, and one can quickly run out of memory by "slurping in the whole tamale." Anyway, I've communicated the large file slurp file to the author of ANTLR and he'll be working out a solution in future releases. (If you think large verilog netlists are problematic to slurp; think aout a SPEF file --- where I first encoutered the problem using ANTLR 3.x. Anyway, back to 2.7.7 works fine, even for large SPEF files.)" As I said, this might have been fixed already, I just don't know. Regards, Martin On 11-03-29 11:29 PM, Sam Harwell wrote: > 4. With proper integration into the build system, generated files > aren't checked into source control or distributed. The ANTLR project > itself generates V2 and V3 grammars, and my .NET projects generate V3 > grammars (using my C# port of the Tool) at build time, so the > generated files never take up space in source control. List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
