Hi,
Hope this isn't too much of a newbie question.
I need to parse a format (EDI) which is basically delimited fields, but some
fields must contain standardized code values whereas other fields can contain
freeform text.
My question is related to lexing and/or parsing. Do I need to/want to have a
lexer token for each possible code, or should I just accept a freeform TEXT
token, and then later parse the actual text to determine if its a valid code?
I currently have a grammar which handles *some* of the more important codes by
specifying lexer tokens. E.g.:
ST: 'ST' ;
BFR: 'BFR' ;
N1: 'N1' ;
REF: 'REF' ;
etc...
And a freeform TEXT token:
TEXT: ('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+ ;
Then I use a parser rule for those possible fields where *any* text is allowed,
even a code:
fieldText : TEXT | code ;
code : ST
| BFR
| N1
| REF
etc...
This seems to be working okay for now, but I forsee problems as I'm trying to
expand the grammar to work with all the various codes defined by the EDI
standards. For example, some of the codes contain solely numeric characters,
such as '09' or '01' or '12'. Later, I want to add checking for freeform
numeric fields, such as those which might contain quantities or arbitrary
integers. I think it will start to get ugly if I try to specify lexer tokens
and parser rules like this:
CODE_09: '09'
NUMERIC: ('0'..'9')+
numericField: NUMERIC | numericCode
numericCode: CODE_09 | CODE_01 ... etc.
The core issue is that I need to *sometimes* treat certain fixed sequences of
characters (e.g. 'ST' or '09') as special, and sometimes as merely freeform
text or numeric values.
I'm fairly new to ANTLR (and parsing/lexing), so I'm not really sure what's a
good way to resolve this. Any tips/pointers?
Example input:
ISA*00**00**01*812520286 // Here '01' is a special code which determines the
type/format of the following field
SE*01*1052 // Here '01' is simply a numeric value which should be interpreted
as an integer.
As a side topic, how do I write a lexer which properly handles both:
a) freeform alphanumeric (and spaces) input such as
('a'..'z'|'A'..'Z'|'0'..'9'|' '|':'|'-'|','|'.')+
b) freeform numeric input such as ('0'..'9')+
Is this doomed to be ambiguous? Should it be handled by the parser? Is there a
way to handle it in the lexer?
Thanks
Rob
_________________________________________________________________
Create a cool, new character for your Windows Live⢠Messenger.
http://go.microsoft.com/?linkid=9656621
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"il-antlr-interest" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/il-antlr-interest?hl=en
-~----------~----~----~----~------~----~------~--~---
List: http://www.antlr.org/mailman/listinfo/antlr-interest
Unsubscribe:
http://www.antlr.org/mailman/options/antlr-interest/your-email-address