Hey all, OK, let me try a related but far less involved question:
ANTLR tokenizes all input into an internal list before parsing anything in that list. (Right?) Hence, it runs out of memory trying to read my 6.2-million-line input file, because that list is huge. What's the ANTLR way to handle such large input streams? Thanks, -Chris On Thu, Apr 29, 2010 at 4:33 PM, Chris verBurg <[email protected]>wrote: > Hey guys, > > A question was posted a few days ago about dealing with an infinite input > stream, and the suggestion was to subclass TokenStream so that it didn't > read in all of the input upfront. > > I'm running into a similar problem, but before I go run off and subclass > things I thought I'd see if there's a "best practice" for my situation. It > also overlaps with the "how do I use keywords as > identifiers<http://www.antlr.org/wiki/pages/viewpage.action?pageId=1741>" > FAQ. > > I have a data-file grammar that recognizes strings, numbers, and a ton of > keywords. Pretending "VERSION" and "LIMIT" are two keywords, here's (part > of) the .g file: > > data_file: > 'VERSION' STRING ';' > | 'LIMIT' NUMBER ';' > ; > > NUMBER: > ('-'|'+')? ('0'..'9')+ > | ('-'|'+')? ('0'..'9')* '.' ('0'..'9')* > ; > > STRING: > ('a'..'z' | 'A'..'Z' | '_' | '.' | '0'..'9')+ ; > > > Problem input #1: > > VERSION 1.2 ; > > The "1.2" is lexed as a number instead of a string, so I get a parse error. > > Problem input #2: > > VERSION LIMIT ; > > The "LIMIT" is lexed as a keyword instead of a string, so I get a parse > error. > > > I saw the FAQ about keywords-as-identifiers, but I don't think it's helpful > for me. For the NUMBER-that-should-be-a-STRING problem, there's no exact > string I could pass to input.LT(1).getText().equals(), because it requires > a regex to match a NUMBER. The other solution was to make an "identifier" > rule to match all possibilities -- is the best solution here really to > change the rule to 'VERSION' (STRING | NUMBER) ';'? > > For the keyword-that-should-be-a-STRING problem, I'm hesitant to use either > of those solutions because of the sheer number of keywords in this grammar. > > > Ideally what I'd like to do is what I did in Flex and Bison (which I'm > porting this grammar from). What I did there was have the parser control > how the lexer interpreted subsequent tokens. I embedded a rule in the > parser, immediately after the 'VERSION' token, to tell Flex to enter a > "force-the-next-token-to-be-a-STRING-no-matter-what" start state. It worked > beautifully. I got most of the way through implementing that in my ANTLR > grammar when I found out that ANTLRFileStream reads all the tokens in before > the parser even starts up -- which means the parser can't give the lexer any > direction over token interpretation. > > > Thoughts, suggestions, outrageous flames? Is there a "good" way to do > this, or maybe is there a completely different approach I should take? > > Thanks! > -Chris > > > List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
