Quoting Marc Girondot via R-help <[email protected]>:
Hi everybody,
I have some questions about the way that sub is working. I hope that
someone has the answer:
1/ Why the second example does not return an empty string ? There is
no match.
subtext <- "-1980-"
sub(".*(1980).*", "\\1", subtext) # return 1980
sub(".*(1981).*", "\\1", subtext) # return -1980-
This is as documented in ?sub:
"Elements of character vectors x which are not
substituted will be returned unchanged"
2/ Based on sub documentation, it replaces the first occurence of a
pattern: why it does not return 1980 ?
subtext <- " 1980 1981 "
sub(".*(198[01]).*", "\\1", subtext) # return 1981
Because the pattern matches the whole string,
not just the year:
regexpr(".*(198[01]).*", subtext)
## [1] 1
## attr(,"match.length")
## [1] 11
## attr(,"useBytes")
## [1] TRUE
From this match, the RE engine will give you the last backreference-match,
which is "1981". If you want to _extract_ the first year, use a
non-greedy RE instead:
sub(".*?(198[01]).*", "\\1", subtext)
## [1] "1980"
I say _extract_ because you may _replace_ the pattern, as expected:
sub("198[01]", "YYYY", subtext)
## [1] " YYYY 1981 "
That is because the pattern does not match the whole string.
Perhaps this example makes it clearer:
test <- "1 2 3 4 5"
sub("([0-9])", "\\1\\1", test)
## [1] "11 2 3 4 5"
sub(".*([0-9]).*", "\\1\\1", test)
## [1] "55"
sub(".*?([0-9]).*", "\\1\\1", test)
## [1] "11"
3/ I want extract year from text; I use:
subtext <- "bla 1980 bla"
sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",
subtext) # return 1980
subtext <- "bla 2010 bla"
sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",
subtext) # return 2010
but
subtext <- "bla 1010 bla"
sub(".*[ \\.\\(-]([12][01289][0-9][0-9])[ \\.\\)-].*", "\\1",
subtext) # return 1010
I would like exclude the case 1010 and other like this.
The solution would be:
18[0-9][0-9] or 19[0-9][0-9] or 200[0-9] or 201[0-9]
Is there a solution to write such a pattern in grep ?
You answered this yourself, I think.
Thanks a lot
Marc
--
Enrico Schumann
Lucerne, Switzerland
http://enricoschumann.net
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.