Re: [re2c-general] HowTo: 'greedy' regex accepts first of several matches?

Marcus Boerger Tue, 20 Feb 2007 08:13:46 -0800

Hello Lynn,

  actually it seems to be the wrong way to use re2c then. What you want
is a state machine driven by re2c input. That is re2c will become your
lexer and something else your parser. So you will use re2c only to detect
the tags (something like "<" [^<>]* ">") and the parser to take care of
actual tag detection/replacement/removement. The parser would eventually
need to run a stack on the detected tags. So an opening tag would be a
push and a matching closing one a pop. For a parser you may consider using
lemon or go with a simple stack integrated into re2c. Also note that you
most likely have to modify the rule above to detect the actual tag name,
that is ignore anything after the first white space. And here you might
need trailing context support. The reason is that html is not perfect xml
code. Also "<" [^<> \t\r\n]* would need change to a new scanner block that
detects the end of the tag. Or your parser has to grep the tag name.


For lemon look here: http://www.hwaci.com/sw/lemon/index.html

best regards
marcus

Tuesday, February 20, 2007, 3:45:05 PM, you wrote:

> <alert comment="not that familiar with regex and rusty with re2" />

> I'm trying to write a scanner than does the equivalent of 'greedily' 
> detecting html tag-pairs, including situations with several of the 
> same tag-pair in the string. An example:
> normal-a <b>bold-b </b> normal-c <b>bold-d </b> normal-e

> I've tried a variety of combinations that are something like:
> /*!re2c
> "<b>".+?"</b>" { code goes here; }
> [\000-\377] { code goes here; }
> */

> This sort of works, but I haven't been able to figure out how to get 
> it to be "greedy". With a "source string" like the previous, I want it 
> to
> "accept" after "consuming" <b>bold-b </b> .... but the scanner keeps 
> on going.

> When I step thru the generated code, I see:
> yyaccept = 1;
> when it it has "consumed" <b>bold-b </b>, but it keeps going and also 
> reaches:
> yyaccept = 1;
> after <b>bold-d </b>.

> I want it to stop/accept after <b>bold-b </b> so the length with be 14 
> rather than 38.

> Can this be done? Am I doing something wrong or leaving something out?

> In the comments for the "strip comments" example, I saw information 
> about "multiple scanner blocks" and also "trailing contexts". Do these 
> apply?

> Is there sample code that demonstrates "best practices" for detecting 
> and removing html tags? Seems like that would be a good use of re2c. 
> Even better would be a sample that demonstrated "best practices" for 
> using re2c to replace html tags with something else.

> Thanks




> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys-and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Re2c-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/re2c-general



-- 
Best regards,
 Marcus                            mailto:[EMAIL PROTECTED]


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Re2c-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/re2c-general

Re: [re2c-general] HowTo: 'greedy' regex accepts first of several matches?

Reply via email to