Crud, there's an error in my example... Sorry about that, it'd blow up at the 'B'.
One more time: TOKEN_1: 'F'; TOKEN_2: 'FOO'; TOKEN_3: 'FOOBAR'; TOKEN_4: 'A'; TOKEN_5: 'B'; The input of: 'FOOBAF' 'F' tokens 1, 2, 3 are eligible, 4, 5 is removed. 'O' tokens 2, 3 are eligible, 1 is removed. 'O' tokens 2, 3 are eligible. 'B' token 3 is eligible, 2 is removed. 'A' token 3 is eligible. 'F' token 3 is not eligible, exception thrown. Rather then generating: TOKEN_2, TOKEN_5, TOKEN_4, TOKEN_1 (which technically would match), unlike my previous example. Kirby On Mon, Dec 14, 2009 at 12:46 PM, Kirby Bohling <[email protected]> wrote: > On Mon, Dec 14, 2009 at 12:18 PM, Jean-Claude Durand > <[email protected]> wrote: >> My lexical grammar (I use antlr v3.2): >> lexer grammar Lex; >> options >> { language=Java; } >> WS: ( ' ' | '\t' | '\n' )+ { $channel=HIDDEN; } ; >> FIN : '-FIN-' ; >> Moins : '-' ; >> // Identifiers: >> Idf : ('A'..'Z')+ ; >> I want to enumerate the tokens for the following example (Main.java is in >> the archive): >> VLEG-XLEG-FCINFZU >> And the output is: >> ~/Soft/Antlr/LexJava: java Main test >> --> [...@-1,0:3='VLEG',<7>,1:0] >> --> [...@-1,4:4='-',<6>,1:4] >> --> [...@-1,5:8='XLEG',<7>,1:5] >> line 1:11 mismatched character 'C' expecting 'I' >> --> [...@-1,12:16='INFZU',<7>,1:12] >> --> [...@-1,17:36=' ',<4>,channel=99,1:17] >> ~/Soft/Antlr/LexJava: >> The lexer is looking for the keyword -FIN- and not for minus sign followed >> by an identifier (which begins with an F). >> > > > This is one of the weirdism's (IMHO, the author of the tool, finds > this totally natural, and he's got 20 odd years of parsing > experience), of the lexer is that there is no backtracking. Hopefully > I've said that correctly, so lets say you have rules like: > > TOKEN_1: 'F'; > TOKEN_2: 'FOO'; > TOKEN_3: 'FOOBAR'; > TOKEN_4: 'AR'; > > Now, if given the input of: 'FOOBFAR', I would expect to get: > TOKEN_2, TOKEN_1, TOKEN_4, however, you'll get an error (I forget which one). > > As it moves along parsing the input, it keeps track of which tokens it > could be, and if the next character could be part of any token, all > the tokens it can't be are removed from consideration. > > So my understanding of how my example works is as follows: > When the 'F' is consumed rules 1, 2, 3 are eligible, 4 has been removed. > When the 'O' is consumed rules 2, 3 are eligible, rule 1 has been removed. > When the 'O' is consumed rules 2, 3 are eligible. > When the 'B' is consumed rules 3 is eligible, rule 2 is removed (at > which point, you can never return TOKEN_2), which would be my expected > behavior. > When the 'F' is consumed no rules are eligible, throw some type of > ANTLR Exception. > > I really wish it'd fall back and give me the last token that did > actually work. From what I can gather, there is some performance > issue with doing that (I've never understood why). Instead, you have > to manually implement all of the back tracking yourself. If you > wanted to do something like that, you'd do something like: > > // NOTE: This is inefficient, in reality, you'd declare TOKEN_3 in the > tokens declaration, and after matching 'FOO', see if the 'BAR' > matches, if it does, return TOKEN_3, if it doesn't return TOKEN_2 and > don't consume any more input. > TOKEN_1: 'F'; > TOKEN_2: 'FOO'; > // The lookahead syntax is from memory, it might be wrong... > // I might also have chosen the wrong one between semantic and > syntactic predicates, I defer to more experience folks again... > TOKEN_3: ('FOOBAR')=> 'FOOBAR'; > TOKEN_4: 'AR'; > > The canonical example of this technique: > http://www.antlr.org/wiki/display/ANTLR3/Lexer+grammar+for+floating+point,+dot,+range,+time+specs > > That's fairly complex, if you don't understand what it's doing, I'd > recommend picking up the ANTLR book. > > So in your case, it sees the '-F', at which point, the only eligible > token is '-FINS-', if the input isn't that (which in your case it > isn't), an exception will be thrown on the first character that does > not match the lexer expectation. > > I have no idea what the best approach for resolving this is. Others > on the list have a much deeper understanding of language design then I > do. I'll defer to them on the best approach for this problem. My > e-mail is just trying to explain what I believe is going on (similar > problems I had confused, and frustrated me for a long time when I > started with ANTLR). > > Kirby > >> >> Thanks a lot for your help. >> Jean-Claude Durand >> LIG, équipe GETALP >> 385, rue de la Bibliothèque >> BP 53 >> 38041 Grenoble cedex 9 >> [email protected] >> tél: +33 (0)4 76 51 43 81 >> fax: +33 (0)4 76 63 56 86 >> >> >> >> >> >> List: http://www.antlr.org/mailman/listinfo/antlr-interest >> Unsubscribe: >> http://www.antlr.org/mailman/options/antlr-interest/your-email-address >> >> > List: http://www.antlr.org/mailman/listinfo/antlr-interest Unsubscribe: http://www.antlr.org/mailman/options/antlr-interest/your-email-address -- You received this message because you are subscribed to the Google Groups "il-antlr-interest" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/il-antlr-interest?hl=en.
