(sorry about the duplicated mail-- stupid gmail sent my message before it was ready) :)
Hi Adrian, Thanks for the quick response. Trying to unpack what you're saying-- do you mean I should try to define a scanner (as defined in section 6.3 of the manual) which tries the various possibilities for street names (in order from most preferred to least)? So one might have main := |* streetWithSuffixAndDirection; streetWithDirection; streetWithSuffix street ? I was looking a little bit more at regular expressions, and it seems like perl compatible re's have some special options which allow you to define how matches are supposed to occur. For example: http://www.boost.org/doc/libs/1_40_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html "*? Matches the previous atom zero or more times, while consuming as little input as possible." seems like exactly what I need (a quick test indicates it gives the desired behaviour). Would it not be possible for ragel to do this sort of thing? Will 2009/9/23 Adrian Thurston <[email protected]>: > Hi William, > > I think what you need is a traditional lexer. See section 6.3 of the manual. > > -Adrian > > William Lachance wrote: >> Hi, >> >> I'm trying to construct a parser for street addresses using Ragel. >> That is to say, a machine that will take a free form address like >> "5553 Barrington Street NW" and parse out the individual components >> (street number, name, suffix, direction). Everything was going >> swimmingly until I started to try to add support for street names with >> multiple tokens in them (e.g. "Bella Vista Avenue NW") >> >> Right now my main machine looks like this: >> >> streetNumber = (digit+ >getStartStr %endNumber); >> streetName = (alpha+ (space+ alpha+)*) >getStartStr %endName; >> suffixFull = space+ suffix >> dirFull = space+ direction >> main := (streetNumber alpha? space+)? streetName suffixFull? dirFull? >> >> The suffix and dir expressions are really long and boring >> concatenations like this: >> >> directionWest = ("w"i|"west"i) >getStartStr %endDirWest; >> >> Anyway, the problem with this simple regular expression is that it >> doesn't give up on parsing the streetName when it begins parsing the >> direction and suffix. So in the above example, it will correctly parse >> "Bella Vista", but then overwrite it with "Avenue", and later "NW". I >> thought that perhaps adding a few ":>>"'s (to stop the processing of >> the streetname when suffixes and directions appear) would help: >> >> main := (streetNumber alpha? space+)? streetName :>> suffixFull? :>> >> dirFull? 0; >> >> Unfortunately, that seems to have the side effect of terminating >> parsing of the street name prematurely (bringing us back to square >> one). >> >> It _seems_ like what I'm doing should be straightforward. Basically >> the rule should be: "keep on parsing the street until you find a token >> that unambiguously matches a suffix and/or direction; at that point, >> stop, only keeping the previous tokens". Surely there's a way of >> expressing that in Ragel? >> > > _______________________________________________ > ragel-users mailing list > [email protected] > http://www.complang.org/mailman/listinfo/ragel-users > -- William Lachance [email protected] _______________________________________________ ragel-users mailing list [email protected] http://www.complang.org/mailman/listinfo/ragel-users
