Thanks David, After struggling with this bug for a day I think Im permanently dain bramaged.
On Thu, May 6, 2010 at 3:54 AM, David Winsemius <dwinsem...@comcast.net>wrote: > > On May 6, 2010, at 2:21 AM, steven mosher wrote: > > see below, >> >> using a regex in sub() fails if the pattern is //d{5} and suceeds >> if the pattern [0-9] {5} is used.. see the test cases below. >> >> issue was not on windows machine and david and I had it on MAC. >> > > Except we both were using \\d rather than //d. > > I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but > with the release of an Hmisc upgrade I will convert soon.) > > -- > David. > > > sessionInfo() > R version 2.10.1 RC (2009-12-09 r50695) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] tcltk stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3 > lattice_0.18-3 > > loaded via a namespace (and not attached): > [1] chron_2.3-35 grid_2.10.1 tools_2.10.1 > >> >> r11 >> >> mac os 10.5 >> >> >> ---------- Forwarded message ---------- >> From: steven mosher <mosherste...@gmail.com> >> Date: Wed, May 5, 2010 at 3:25 PM >> Subject: Re: [R] extracting a matched string using regexpr >> To: David Winsemius <dwinsem...@comcast.net> >> Cc: Gabor Grothendieck <ggrothendi...@gmail.com>, r-help < >> r-h...@r-project.org> >> >> >> with a fresh restart >> >> >> >> >> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >> >>> >>> test >>> >> [1] >> >> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >> >>> sub(".*(\\d{5}).*", "\\1", test) >>> >> [1] "</th>" >> >>> sub(".*([0-9]{5}).*", "\\1", test) >>> >> [1] "88958" >> >>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW" >>> sub(".*(\\d{5}).*", "\\1", test2) >>> >> [1] "WWWWW" >> >>> >>> sub(".*(\\d{5}).*", "\\1", test2) >>> >> [1] "WWWWW" >> >>> sub(".*([0-9]{5}).*", "\\1", test2) >>> >> [1] "12345" >> >> >> Steve. >> >> >> >> On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsem...@comcast.net >> >wrote: >> >> >>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote: >>> >>> Here are two ways to extract 5 digits. >>> >>>> >>>> In the first one \\1 refers to the portion matched between the >>>> parentheses in the regular expression. >>>> >>>> In the second one strapply is like apply where the object to be worked >>>> on is the first argument (array for apply, string for strapply) the >>>> second modifies it (which dimension for apply, regular expression for >>>> strapply) and the last is a function which acts on each value >>>> (typically each row or column for apply and each match for strapply). >>>> In this case we use c as our function to just return all the results. >>>> They are returned in a list with one component per string but here >>>> test is just a single string so we get a list one long and we ask for >>>> the contents of the first component using [[1]]. >>>> >>>> # 1 - sub >>>> sub(".*(\\d{5}).*", "\\1", test) >>>> >>>> test >>>> >>> [1] >>> >>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >>> >>> I get different results than I expected given that "\\d" should be >>> synonymous with "[0-9]": >>> >>> >>> sub(".*([0-9]{5}).*", "\\1", test) >>>> >>> [1] "88958" >>> >>> sub(".*(\\d{5}).*", "\\1", test) >>>> >>> [1] "</th>" >>> >>> -- >>> David. >>> >>> >>>> # 2 - strapply - see http://gsubfn.googlecode.com >>>> library(gsubfn) >>>> strapply(test, "\\d{5}", c)[[1]] >>>> >>>> >>>> >>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <mosherste...@gmail.com> >>>> wrote: >>>> >>>> Given a text like >>>>> >>>>> I want to be able to extract a matched regular expression from a piece >>>>> of >>>>> text. >>>>> >>>>> this apparently works, but is pretty ugly >>>>> # some html >>>>> >>>>> >>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >>>>> # a pattern to extract 5 digits >>>>> >>>>> pattern<-"[0-9]{5}" >>>>>> >>>>>> # regexpr returns a start point[1] and an attribute "match.length" >>>>> attr(,"match.length) >>>>> # get the substring from the start point to the stop point.. where stop >>>>> = >>>>> start +length-1 >>>>> >>>>> >>>>>> >>>>>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1) >>>>>> >>>>> >>>>> answer >>>>>> >>>>>> [1] "88958" >>>>> >>>>> I tried using sub(pattern, replacement, x ) with a regexp that >>>>> captured >>>>> the >>>>> group. I'd found an example of this in the mails >>>>> but it didnt seem to work.. >>>>> >>>>> >>>> ______________________________________________ >>>> r-h...@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> David Winsemius, MD >>> West Hartford, CT >>> >>> >>> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> R-SIG-Mac mailing list >> R-SIG-Mac@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >> > > David Winsemius, MD > West Hartford, CT > > [[alternative HTML version deleted]] _______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac