Hello Lynn, actually it seems to be the wrong way to use re2c then. What you want is a state machine driven by re2c input. That is re2c will become your lexer and something else your parser. So you will use re2c only to detect the tags (something like "<" [^<>]* ">") and the parser to take care of actual tag detection/replacement/removement. The parser would eventually need to run a stack on the detected tags. So an opening tag would be a push and a matching closing one a pop. For a parser you may consider using lemon or go with a simple stack integrated into re2c. Also note that you most likely have to modify the rule above to detect the actual tag name, that is ignore anything after the first white space. And here you might need trailing context support. The reason is that html is not perfect xml code. Also "<" [^<> \t\r\n]* would need change to a new scanner block that detects the end of the tag. Or your parser has to grep the tag name.
For lemon look here: http://www.hwaci.com/sw/lemon/index.html best regards marcus Tuesday, February 20, 2007, 3:45:05 PM, you wrote: > <alert comment="not that familiar with regex and rusty with re2" /> > I'm trying to write a scanner than does the equivalent of 'greedily' > detecting html tag-pairs, including situations with several of the > same tag-pair in the string. An example: > normal-a <b>bold-b </b> normal-c <b>bold-d </b> normal-e > I've tried a variety of combinations that are something like: > /*!re2c > "<b>".+?"</b>" { code goes here; } > [\000-\377] { code goes here; } > */ > This sort of works, but I haven't been able to figure out how to get > it to be "greedy". With a "source string" like the previous, I want it > to > "accept" after "consuming" <b>bold-b </b> .... but the scanner keeps > on going. > When I step thru the generated code, I see: > yyaccept = 1; > when it it has "consumed" <b>bold-b </b>, but it keeps going and also > reaches: > yyaccept = 1; > after <b>bold-d </b>. > I want it to stop/accept after <b>bold-b </b> so the length with be 14 > rather than 38. > Can this be done? Am I doing something wrong or leaving something out? > In the comments for the "strip comments" example, I saw information > about "multiple scanner blocks" and also "trailing contexts". Do these > apply? > Is there sample code that demonstrates "best practices" for detecting > and removing html tags? Seems like that would be a good use of re2c. > Even better would be a sample that demonstrated "best practices" for > using re2c to replace html tags with something else. > Thanks > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Re2c-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/re2c-general -- Best regards, Marcus mailto:[EMAIL PROTECTED] ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Re2c-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/re2c-general
