The following requires more than just a single gsub but it does solve the problem. Modify to suit.
The first gsub places <...> around the first occurrence of any duplicated suffixes. We use the (?=...) zero width regexp to circumvent the nesting problem. Then we use strapply from the gsubfn package to extract the suffixes so marked and paste them together to pass to a second gsub which locates them in the original string appending an <r> to each. Uncomment the commented pat if you only want to match 2+ character suffixes. library(gsubfn) # places <...> around first occurrences of repeated suffixes text <- "And this is the second sentence" pat <- "(\\w+)(?=\\b.+\\1\\b)" # pat <- "(\\w\\w+)(?=\\b.+\\1\\b)" out <- gsub(pat, "\\<\\1\\>", text, perl = TRUE) suff <- strapply(out, "<([^>]+)>", function(x,y)y)[[1]] gsub(paste("(", paste(suff, collapse = "|"), ")\\b", sep = ""), "\\1<r>", text) On 7/22/06, Stefan Th. Gries <[EMAIL PROTECTED]> wrote: > Dear all > > I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and > I have two related regular expression problems. > > platform i386-pc-mingw32 > arch i386 > os mingw32 > system i386, mingw32 > status > major 2 > minor 3.1 > year 2006 > month 06 > day 01 > svn rev 38247 > language R > version.string Version 2.3.1 (2006-06-01) > > > I would like to find cases of words in elements of character vectors that end > in the same character sequences; if I find such cases, I want to add <r> to > both potentially rhyming sequences. An example: > > INPUT:This is my dog. > DESIRED OUTPUT: This<r> is<r> my dog. > > I found a solution for cases where the potentially rhyming words are adjacent: > > text<-"This is my dog." > gsub("(\\w+?)(\\W\\w+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > However, with another text vector, I came across two problems I cannot seem > to solve and for which I would love to get some input. > > (i) While I know what to do for non-adjacent words in general > > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", "This not is my dog", > perl=TRUE) # I know this is not proper English ;-) > > this runs into problems with overlapping matches: > > text<-"And this is the second sentence" > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > [1] "And<r> this is the second<r> sentence" > > It finds the "nd" match, but since the "is" match is within the two "nd"'s, > it doesn't get it. Any ideas on how to get all pairwise matches? > > (ii) How would one tell R to match only when there are 2+ characters > matching? If the above expression is applied to another character string > > text<-"this is an example sentence." > gsub("(\\w+?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > it also matches the "e"'s at the end of example and sentence. It's not > possible to get rid of that by specifying a range such as {2,} > > text<-"this is an example sentence." > gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > because, as I understand it, this requires the 2+ cases of \\w to be > identical characters: > > text<-"doo yoo see mee?" > gsub("(\\w{2,}?)(\\W.+?)\\1(\\W)", "\\1<r>\\2\\1<r>\\3", text, perl=TRUE) > > Again, any ideas? > > I'd really appreciate any snippets of codes, pointers, etc. > Thanks so much, > STG > -- > Stefan Th. Gries > ----------------------------------------------- > University of California, Santa Barbara > http://www.linguistics.ucsb.edu/faculty/stgries > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.