John, Thank you very much for this, and even more for the pointer to the markmail version of this list's archive -- much more effective for search purposes!! Sheila
________________________________________ From: John B. Brodie [[email protected]] Sent: Friday, November 19, 2010 8:56 PM To: Sheila M. Morrissey Cc: '[email protected]' Subject: Re: [antlr-interest] Missing something basic about lexer tokens Greetings! On Fri, 2010-11-19 at 18:58 -0500, Sheila M. Morrissey wrote: > Hello, > > I am working on a recognizer that processes a text file, each line of which > starts with one of short list of about 20 characters (mostly either upper > case or lower case letters, a few special chars), immediately followed by a > "name" (chars or dash), a space or 2, and then various space-delimited > stretches of text comprised of arbitrarily any ASCII character Except > newline, followed by newline. > > The first letter is significant - it indicates what sort of "command" each > line is. > > Here's a simplified version of the grammar, with just one of these "commands" > specified: > > grammar ElementAttributes; > > options { > language = Java; > } > @parser::header {} > @lexer::header {} > > elementAttributes : elementAttributeCommand+ EOF; > > /** > e.g. > Aname IMPLIED > */ > > elementAttributeCommand : ACMD NAME SPACE+ ATTRTYPE NEWLINE; > > ATTRTYPE : ('IMPLIED'|'CDATA'|'NOTATION'|'ENTITY'|'TOKEN'|'ID'|'DATA'); > ACMD : 'A'; > NEWLINE: '\r'? '\n'; > SPACE: ' '; > NAME : (NAMESTARTCHAR NAMECHAR*); > > fragment LOWERCASELETTER : ('a'..'z'); > fragment UPPERCASELETTER : ('A'..'Z'); > fragment DIGIT : ('0'..'9'); > fragment DASH : ('-'); > fragment NAMESTARTCHAR : (LOWERCASELETTER | UPPERCASELETTER); > fragment NAMECHAR : (NAMESTARTCHAR | DIGIT | DASH); > > > If run on a file consisting only of the line (terminated with NEWLINE) > Aname IMPLIED > > I get the following error: > line 1:0 required (...)+ loop did not match anything at input 'Aname' > > How should I be declaring the lexer rules so that 'A' at start of line is > recognized as a command token, and yet still make it possible for the "NAME" > immediately following it to be unambiguously recognized? > Please recall 3 facts about current ANTLR lexers: 1) they recognize tokens independent from any parsing context; and 2) they do not back-track (once committed to recognizing a prefix of a token the rest of the input must match that token); and 3) they are greedy and happily recognize the longest valid string possible. (i suspect you already know the above facts, but i repeat them in case someone in the future searches the mailing-list archive at markmail.antlr.org and finds this message without that knowledge) and so, as you have observed, when the input word "Aname" is seen by your lexer it will produce the token NAME because that single token greedily matches all of the characters in that input word. and so your requirement "at the start of the line" must be, somehow, encoded into your lexer rule(s) for command(s) like ACMD. i believe you can read a discussion of this issue by searching the archives at markmail.antlr.org for messages about special tokens at the beginning of a line. i seem to remember (i haven't reviewed the archives) that it boils down to 3 possibilities: 1) add a predicate(s) to test the start character index of the token to ensure that it is at the beginning of a line 2) use a rule of the form ACMD : NEWLINE 'A' ; which works for the second and subsequent lines of input. But requires creating a special sub-class of the input reader that always delivers a NEWLINE as the very first character and then delivers characters from the actual input after as the second and subsequent characters. and then of course your parser rules should not insist upon a NEWLINE at the end of a command (because that NEWLINE is part of the verb that starts the next command). 3) use a Lex-based lexer rather than and ANTLR-based lexer. search the archives for jLex. Lex-based lexers are more oriented around regular expressions -- so start-of-line and end-of-line are more easily detected/used. I also believe that Dr. Parr is looking at version 3 lexer issues such as this one and is trying to improve things for version 4. search the mailing list archives for Dr Parr's posts regarding version 4. (as an aside i think most of my mailing list search suggestions will actually result in pointers to pages in the wiki -- i am too lazy to actually give you those links directly, sorry) Hope this helps... -jbb List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
