On May 6, 2010, at 11:50 AM, David Winsemius wrote: > Two Q's: > A) Is this supposed to happen with perl-mode?: > > > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" > > > > sub(".*(\\d{5}).*", "\\1", test, perl=TRUE) > [1] "88958\nW</th><th>26m</th>" > > > > sub(".*([0-9]{5}).*", "\\1", test, perl=TRUE) > [1] "88958\nW</th><th>26m</th>" >
Nope - perl does take EOL into account so .* will be matched only to the end of line. For your purposes you want to enable ?s option, so you probably meant: > sub("(?s).*(\\d{5}).*", "\\1", test, perl=TRUE) [1] "88958" > Looks to me that a period is being improperly recognized. > > On May 6, 2010, at 11:28 AM, Simon Urbanek wrote: > >> FWIW I don't think \d is a basic regexp > > B) With regard to the default (which I read to be extended rather than > basic) vs. perl-like, the Extended section of the regex documentation > contains: > > " Symbols \d, \s, \D and \S denote the digit and space classes and their > negations." > Yes, you're right - extended is the default. Cheers, Simon >> so as I would expect the perl mode to work and it does: >> >>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW" >>> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE) >> [1] "12345" >> >> Yet I agree that if should either fail (i.e. return the unmodified string) >> or return 12345. >> >> Also note that the bug is locale-specific: >> >> LANG=C R >> >>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW" >>> sub(".*(\\d{5}).*", "\\1", test2,perl=TRUE) >> [1] "12345" >>> sub(".*(\\d{5}).*", "\\1", test2) >> [1] "12345" >> >> Also note that this is not Mac-specific: >> >>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW" >>> sub(".*(\\d{5}).*", "\\1", test2) >> [1] "WWWWW" >>> system("uname -sr") >> Linux 2.6.32-trunk-amd64 >>> Sys.getlocale("LC_CTYPE") >> [1] "en_US.UTF-8" >> >> >> Cheers, >> Simon >> >> >> >> On May 6, 2010, at 6:54 AM, David Winsemius wrote: >> >>> >>> On May 6, 2010, at 2:21 AM, steven mosher wrote: >>> >>>> see below, >>>> >>>> using a regex in sub() fails if the pattern is //d{5} and suceeds >>>> if the pattern [0-9] {5} is used.. see the test cases below. >>>> >>>> issue was not on windows machine and david and I had it on MAC. >>> >>> Except we both were using \\d rather than //d. >>> >>> I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but >>> with the release of an Hmisc upgrade I will convert soon.) >>> >>> -- >>> David. >>> >>>> sessionInfo() >>> R version 2.10.1 RC (2009-12-09 r50695) >>> x86_64-apple-darwin9.8.0 >>> >>> locale: >>> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 >>> >>> attached base packages: >>> [1] tcltk stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3 >>> lattice_0.18-3 >>> >>> loaded via a namespace (and not attached): >>> [1] chron_2.3-35 grid_2.10.1 tools_2.10.1 >>>> >>>> r11 >>>> >>>> mac os 10.5 >>>> >>>> ---------- Forwarded message ---------- >>>> From: steven mosher <mosherste...@gmail.com> >>>> Date: Wed, May 5, 2010 at 3:25 PM >>>> Subject: Re: [R] extracting a matched string using regexpr >>>> To: David Winsemius <dwinsem...@comcast.net> >>>> Cc: Gabor Grothendieck <ggrothendi...@gmail.com>, r-help < >>>> r-h...@r-project.org> >>>> >>>> >>>> with a fresh restart >>>> >>>> >>>> >>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >>>>> >>>>> test >>>> [1] >>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >>>>> sub(".*(\\d{5}).*", "\\1", test) >>>> [1] "</th>" >>>>> sub(".*([0-9]{5}).*", "\\1", test) >>>> [1] "88958" >>>>> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW" >>>>> sub(".*(\\d{5}).*", "\\1", test2) >>>> [1] "WWWWW" >>>>> >>>>> sub(".*(\\d{5}).*", "\\1", test2) >>>> [1] "WWWWW" >>>>> sub(".*([0-9]{5}).*", "\\1", test2) >>>> [1] "12345" >>>> >>>> >>>> Steve. >>>> >>>> >>>> >>>> On Wed, May 5, 2010 at 3:20 PM, David Winsemius >>>> <dwinsem...@comcast.net>wrote: >>>> >>>>> >>>>> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote: >>>>> >>>>> Here are two ways to extract 5 digits. >>>>>> >>>>>> In the first one \\1 refers to the portion matched between the >>>>>> parentheses in the regular expression. >>>>>> >>>>>> In the second one strapply is like apply where the object to be worked >>>>>> on is the first argument (array for apply, string for strapply) the >>>>>> second modifies it (which dimension for apply, regular expression for >>>>>> strapply) and the last is a function which acts on each value >>>>>> (typically each row or column for apply and each match for strapply). >>>>>> In this case we use c as our function to just return all the results. >>>>>> They are returned in a list with one component per string but here >>>>>> test is just a single string so we get a list one long and we ask for >>>>>> the contents of the first component using [[1]]. >>>>>> >>>>>> # 1 - sub >>>>>> sub(".*(\\d{5}).*", "\\1", test) >>>>>> >>>>>> test >>>>> [1] >>>>> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >>>>> >>>>> I get different results than I expected given that "\\d" should be >>>>> synonymous with "[0-9]": >>>>> >>>>> >>>>>> sub(".*([0-9]{5}).*", "\\1", test) >>>>> [1] "88958" >>>>> >>>>>> sub(".*(\\d{5}).*", "\\1", test) >>>>> [1] "</th>" >>>>> >>>>> -- >>>>> David. >>>>> >>>>>> >>>>>> # 2 - strapply - see http://gsubfn.googlecode.com >>>>>> library(gsubfn) >>>>>> strapply(test, "\\d{5}", c)[[1]] >>>>>> >>>>>> >>>>>> >>>>>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <mosherste...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Given a text like >>>>>>> >>>>>>> I want to be able to extract a matched regular expression from a piece >>>>>>> of >>>>>>> text. >>>>>>> >>>>>>> this apparently works, but is pretty ugly >>>>>>> # some html >>>>>>> >>>>>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>" >>>>>>> # a pattern to extract 5 digits >>>>>>> >>>>>>>> pattern<-"[0-9]{5}" >>>>>>>> >>>>>>> # regexpr returns a start point[1] and an attribute "match.length" >>>>>>> attr(,"match.length) >>>>>>> # get the substring from the start point to the stop point.. where stop >>>>>>> = >>>>>>> start +length-1 >>>>>>> >>>>>>>> >>>>>>>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1) >>>>>>> >>>>>>>> answer >>>>>>>> >>>>>>> [1] "88958" >>>>>>> >>>>>>> I tried using sub(pattern, replacement, x ) with a regexp that captured >>>>>>> the >>>>>>> group. I'd found an example of this in the mails >>>>>>> but it didnt seem to work.. >>>>>>> >>>>>> >>>>>> ______________________________________________ >>>>>> r-h...@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>> >>>>> David Winsemius, MD >>>>> West Hartford, CT >>>>> >>>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> R-SIG-Mac mailing list >>>> R-SIG-Mac@stat.math.ethz.ch >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >>> >>> David Winsemius, MD >>> West Hartford, CT >>> >>> _______________________________________________ >>> R-SIG-Mac mailing list >>> R-SIG-Mac@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >>> >>> >> > > David Winsemius, MD > West Hartford, CT > > _______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac