On May 6, 2010, at 2:21 AM, steven mosher wrote:

see below,

using a regex in sub()  fails if the pattern is //d{5} and suceeds
if the pattern [0-9] {5} is used.. see the test cases below.

issue was not on windows machine and david and I had it on MAC.

Except we both were using \\d rather than //d.

I believe that Steve is using R 2.11.0 but I am still using R 2.10.1 (but with the release of an Hmisc upgrade I will convert soon.)

--
David.

> sessionInfo()
R version 2.10.1 RC (2009-12-09 r50695)
x86_64-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tcltk stats graphics grDevices utils datasets methods base

other attached packages:
[1] gsubfn_0.5-2 proto_0.3-8 zoo_1.6-3 SASxport_1.2.3 lattice_0.18-3

loaded via a namespace (and not attached):
[1] chron_2.3-35 grid_2.10.1  tools_2.10.1

r11

mac os 10.5

---------- Forwarded message ----------
From: steven mosher <mosherste...@gmail.com>
Date: Wed, May 5, 2010 at 3:25 PM
Subject: Re: [R] extracting a matched string using regexpr
To: David Winsemius <dwinsem...@comcast.net>
Cc: Gabor Grothendieck <ggrothendi...@gmail.com>, r-help <
r-h...@r-project.org>


with a fresh restart



test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ th><th>68.9\nW</th><th>26m</th>"

test
[1]
"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</ th><th>26m</th>"
sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>"
sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"
test2<-"aaaaaaaaaaaaaaaaaaa12345WWWWWWWWWWWWW"
sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"

sub(".*(\\d{5}).*", "\\1", test2)
[1] "WWWWW"
sub(".*([0-9]{5}).*", "\\1", test2)
[1] "12345"


Steve.



On Wed, May 5, 2010 at 3:20 PM, David Winsemius <dwinsem...@comcast.net >wrote:


On May 5, 2010, at 5:35 PM, Gabor Grothendieck wrote:

Here are two ways to extract 5 digits.

In the first one \\1 refers to the portion matched between the
parentheses in the regular expression.

In the second one strapply is like apply where the object to be worked
on is the first argument (array for apply, string for strapply) the
second modifies it (which dimension for apply, regular expression for
strapply) and the last is a function which acts on each value
(typically each row or column for apply and each match for strapply). In this case we use c as our function to just return all the results.
They are returned in a list with one component per string but here
test is just a single string so we get a list one long and we ask for
the contents of the first component using [[1]].

# 1 - sub
sub(".*(\\d{5}).*", "\\1", test)

test
[1]
"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</th><th>68.9\nW</ th><th>26m</th>"

I get different results than I expected given that "\\d" should be
synonymous with "[0-9]":


sub(".*([0-9]{5}).*", "\\1", test)
[1] "88958"

sub(".*(\\d{5}).*", "\\1", test)
[1] "</th>"

--
David.


# 2 - strapply - see http://gsubfn.googlecode.com
library(gsubfn)
strapply(test, "\\d{5}", c)[[1]]



On Wed, May 5, 2010 at 5:13 PM, steven mosher <mosherste...@gmail.com >
wrote:

Given a text like

I want to be able to extract a matched regular expression from a piece of
text.

this apparently works, but is pretty ugly
# some html

test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ th><th>68.9\nW</th><th>26m</th>"
# a pattern to extract 5 digits

pattern<-"[0-9]{5}"

# regexpr returns a start point[1] and an attribute "match.length"
attr(,"match.length)
# get the substring from the start point to the stop point.. where stop =
start +length-1


answer<-substr(test,regexpr(pattern,test) [1],regexpr(pattern,test) [1]+attr(regexpr(pattern,test),"match.length")-1)

answer

[1] "88958"

I tried using sub(pattern, replacement, x ) with a regexp that captured
the
group. I'd found an example of this in the mails
but it didnt seem to work..


______________________________________________
r-h...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT



        [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

David Winsemius, MD
West Hartford, CT

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Reply via email to