My goal is to remove all instances of the word "the" (and any surrounding
whitespace) from a string.  The code I started with is:

    pcrecpp::RE_Options options;
    options.set_utf8(true).set_caseless(true);
    pcrecpp::RE regex("(^|\\s+)The($|\\s+)",options);

    regex.GlobalReplace("",&some_string);

This works for most strings, but not for "The the".  It took me a bit to
figure out why, but I think I at least understand that much.  The string
initially matches "^The " and then we're left with "the".  Unfortunately
this no longer matches the regex because "the" doesn't begin the original
string, nor does it start with whitespace.  

I could take the "($|s\\+)" from the regex, but that makes other things fail
(e.g. "The foo the" becomes " foo" instead of "foo").  Other mods I've come
up cause other failures too.

My tests pass if I call GlobalReplace in a loop, like this:

    do {
        num_replacements = regex.GlobalReplace("",&std_normalized);
    } while (num_replacements > 0);

but I'm curious if this is a normal/good/optimal thing to do or if there's a
smarter regex to use that does everything in one call (maybe GlobalReplace
or I suppose another function).

Thanks much for your help.

-DB


-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to