see below,

using a regex in sub()  fails if the pattern is //d{5} and suceeds
if the pattern [0-9] {5} is used.. see the test cases below.

issue was not on windows machine and david and I had it on MAC.

r11

mac os 10.5

---------- Forwarded message ----------
From: steven mosher <mosherste...@gmail.com>
Date: Wed, May 5, 2010 at 3:25 PM
Subject: Re: [R] extracting a matched string using regexpr
To: David Winsemius <dwinsem...@comcast.net>
Cc: Gabor Grothendieck <ggrothendi...@gmail.com>, r-help <
r-h...@r-project.org>


with a fresh restart



test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>
> test
[1]
"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
> sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>"
> sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"
> test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
> sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
>
> sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
> sub(".*([0-9]{5}).*", "\\1", test2)
[1] "12345"


Steve.



On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsem...@comcast.net>wrote:

>
> On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:
>
>  Here are two ways to extract 5 digits.
>>
>> In the first one \\1 refers to the portion matched between the
>> parentheses in the regular expression.
>>
>> In the second one strapply is like apply where the object to be worked
>> on is the first argument (array for apply, string for strapply) the
>> second modifies it (which dimension for apply, regular expression for
>> strapply) and the last is a function which acts on each value
>> (typically each row or column for apply and each match for strapply).
>> In this case we use c as our function to just return all the results.
>> They are returned in a list with one component per string but here
>> test is just a single string so we get a list one long and we ask for
>> the contents of the first component using [[1]].
>>
>> # 1 - sub
>> sub(".*(\\d{5}).*", "\\1", test)
>>
> > test
> [1]
> "</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>
> I get different results than I expected given that "\\d" should be
> synonymous with "[0-9]":
>
>
> > sub(".*([0-9]{5}).*", "\\1", test)
> [1] "88958"
>
> > sub(".*(\\d{5}).*", "\\1", test)
> [1] "</th>"
>
> --
> David.
>
>>
>> # 2 - strapply - see http://gsubfn.googlecode.com
>> library(gsubfn)
>> strapply(test, "\\d{5}", c)[[1]]
>>
>>
>>
>> On Wed, May 5, 2010 at 5:13 PM, steven mosher <mosherste...@gmail.com>
>> wrote:
>>
>>> Given a text like
>>>
>>> I want to be able to extract a matched regular expression from a piece of
>>> text.
>>>
>>> this apparently works, but is pretty ugly
>>> # some html
>>>
>>> test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</th><th>26m</th>"
>>> # a pattern to extract 5 digits
>>>
>>>> pattern<-"[0-9]{5}"
>>>>
>>> # regexpr returns a start point[1] and an attribute "match.length"
>>> attr(,"match.length)
>>> # get the substring from the start point to the stop point.. where stop =
>>> start +length-1
>>>
>>>>
>>>> answer<-substr(test,regexpr(pattern,test)[1],regexpr(pattern,test)[1]+attr(regexpr(pattern,test),"match.length")-1)
>>>
>>>> answer
>>>>
>>> [1] "88958"
>>>
>>> I tried using sub(pattern, replacement, x )  with a regexp that captured
>>> the
>>> group. I'd found an example of this in the mails
>>> but it didnt seem to work..
>>>
>>
>> ______________________________________________
>> r-h...@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> David Winsemius, MD
> West Hartford, CT
>
>

        [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Reply via email to