[il-antlr-interest: 24185] [antlr-interest] Keywords vs. freeform text

Dukie Banderjee Fri, 12 Jun 2009 21:29:08 -0700

Hi, 
Hope this isn't too much of a newbie question.

I need to parse a format (EDI) which is basically delimited fields, but some 
fields must contain standardized code values whereas other fields can contain 
freeform text.


My question is related to lexing and/or parsing. Do I need to/want to have a 
lexer token for each possible code, or should I just accept a freeform TEXT 
token, and then later parse the actual text to determine if its a valid code?

I currently have a grammar which handles *some* of the more important codes by 
specifying lexer tokens. E.g.:

ST: 'ST' ;
BFR: 'BFR' ;
N1: 'N1' ;
REF: 'REF' ;
etc...

And a freeform TEXT token:
TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+ ;

Then I use a parser rule for those possible fields where *any* text is allowed, 
even a code:
fieldText    : TEXT | code ;

code    : ST
        | BFR
        | N1
        | REF
etc...

This seems to be working okay for now, but I forsee problems as I'm trying to 
expand the grammar to work with all the various codes defined by the EDI 
standards. For example, some of the codes contain solely numeric characters, 
such as '09' or '01' or '12'. Later, I want to add checking for freeform 
numeric fields, such as those which might contain quantities or arbitrary 
integers. I think it will start to get ugly if I try to specify lexer tokens 
and parser rules like this:

CODE_09: '09'

NUMERIC: ('0'..'9')+

numericField: NUMERIC | numericCode

numericCode: CODE_09 | CODE_01 ... etc.

The core issue is that I need to *sometimes* treat certain fixed sequences of 
characters (e.g. 'ST' or '09') as special, and sometimes as merely freeform 
text or numeric values.

I'm fairly new to ANTLR (and parsing/lexing), so I'm not really sure what's a 
good way to resolve this. Any tips/pointers?

Example input:
ISA*00**00**01*812520286  // Here '01' is a special code which determines the 
type/format of the following field
SE*01*1052   // Here '01' is simply a numeric value which should be interpreted 
as an integer.

As a side topic, how do I write a lexer which properly handles both:
a) freeform alphanumeric (and spaces) input such as 
('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+
b) freeform numeric input such as ('0'..'9')+
Is this doomed to be ambiguous? Should it be handled by the parser? Is there a 
way to handle it in the lexer?

Thanks

Rob

_________________________________________________________________
Create a cool, new character for your Windows Live™ Messenger. 
http://go.microsoft.com/?linkid=9656621
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"il-antlr-interest" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/il-antlr-interest?hl=en
-~----------~----~----~----~------~----~------~--~---

List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe: 
http://www.antlr.org/mailman/options/antlr-interest/your-email-address

[il-antlr-interest: 24185] [antlr-interest] Keywords vs. freeform text

Reply via email to