Re: [R] regex - negate a word
Prof Brian Ripley wrote: On Mon, 19 Jan 2009, Rolf Turner wrote: On 19/01/2009, at 10:44 AM, Gabor Grothendieck wrote: Well, that's why it was only provided when you insisted. This is not what regexp's are good at. On Sun, Jan 18, 2009 at 4:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Thanks! (I have to admit, though, that I expected something simple) It may not be what regexp's are good at, but the grep command in unix/linux does what is required *very* simply via the ``-v'' flag. I conjecture that it would not be difficult to add an argument with similar impact to the grep() function in R. Indeed. I have often wondered why grep() returned indices, when a logical vector would seem more natural in R (and !grep(...) would have been all that was needed). Looking at the code I see it does in fact compute a logical vector, just not return it. So adding 'invert' (the long-form of -v is --invert) is a job of a very few lines and I have done so for 2.9.0. in fact, it's simpler than that. instead of redundantly distributing the fix over four different lines in character.c, it's enough to ^= the logical vector of matched/unmatched flags in just one place, on-the-fly, close to the end of the loop over the vector of input strings. see attached patch. for consistency, you might want to - name the internal invert flag 'invert_opt' instead of 'invert'; - apply the same fix to agrep. it's also trivial to add another argument to grep, say 'logical', which will cause grep to return a logical vector of the same length as the input strings vector. see the attached patch. note: i am novice to r internals, and i get some mystical warnings i haven't decoded yet while using the extended grep, but otherwise the code compiles well and grep works as intended; you'd need to fix the cause of the warnings. if you want the 'logical' argument, you need to decide how it interacts with 'values'. in the patch, 'values' set to TRUE resets 'logical' to FALSE, with a warning. further suggestions: the arguments 'values' and 'logical' could be replaced with one argument, say 'output', which would take a value from {'indices', 'values', 'logical'}. it might make further extensions easier to implement and maintain. attached are patches to character.c, names.c, and grep.R; if you tell me which other files need a patch to get rid of the warnigns (see below), i'll make one. s = c(abc, bcd, cde) grep(b, s) # 1 2 grep(b, s, value=TRUE) # abc bcd grep(b, s, logical=TRUE) # TRUE TRUE FALSE s[grep(b, s, logical=TRUE)] # abc bcd # Warning: stack imbalance in 'grep', 9 then 10 # Warning: stack imbalance in '.Internal', 8 then 9 # Warning: stack imbalance in '{', 6 then 7 grep(b, s, invert=TRUE) # 3 grep(b, s, invert=TRUE, value=TRUE) # cde s[!grep(b, s, logical)] # cde # Warning: stack imbalance in 'grep', 15 then 16 # Warning: stack imbalance in '.Internal', 14 then 15 # Warning: stack imbalance in '{', 12 then 13 # Warning: stack imbalance in '!', 6 then 7 # Warning: stack imbalance in '[', 2 then 3 vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Wacek Kusnierczyk wrote: attached are patches to character.c, names.c, and grep.R; if you tell me forgot to add: the patches are against the latest r-devel (19.01.2009). compiled and tested on 32b Ubuntu 8.04. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Rolf Turner wrote: On 19/01/2009, at 10:44 AM, Gabor Grothendieck wrote: Well, that's why it was only provided when you insisted. This is not what regexp's are good at. On Sun, Jan 18, 2009 at 4:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Thanks! (I have to admit, though, that I expected something simple) It may not be what regexp's are good at, but the grep command in unix/linux does what is required *very* simply via the ``-v'' flag. I conjecture that it would not be difficult to add an argument with similar impact to the grep() function in R. something like grep(..., inverse=TRUE), perhaps. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Stavros Macrakis wrote: On Sun, Jan 18, 2009 at 2:22 PM, Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: x[-grep(abc, x)] which unfortunately fails if none of the strings in x matches the pattern, i.e., grep returns integer(0); Yes. arguably, x[integer(0)] should rather return all elements of x The meaning of x[V] (for an integer subscript vector V) is: what about numeric vectors? r performs smart downcasting here: x[1.1] # same as x[1] x[0.3] # character(0) ignore 0 entries, and then: what if V=NULL? a) if !(all(V0) | all(V0) ) = ERROR there is no error for x[v] with V=0, V=as.numeric(NA), or V=NaN. b) if all (V0): length(x[V]) == length(V) unfortunately, false if v contains a non-integer (so it goes beyond your discussion, but may cause problems in practice): x[c(1, 0.5)] # one item (if x is non-empty) c) if all (V0): length(x[V]) == length(x)-length(unique(V)) not true for cases like V=c(-1, -1.5), which again go beyond your discussion, but may happen in practice. interestingly, unique(c(NA, NA)) is just NA, rather than c(NA,NA). i'd think that if we have two non-available values, we can't be sure they're in fact equal, but unique apparently is. (you'd have to tell it not to be with incomparables=NA.) When length(V)==0, the preconditions are true for both (b) and (c), so interestingly, all(V0) all(V0) is TRUE for V=c(). the R design has made the decision that length(x[V]) == 0 in this case. If you're going to have the negative indices means exclusion trick, this seems like a reasonable convention. i didn't say this was unreasonable, just that x[integer(0)] should, arguably, return x. 'empty index' is not as precise an expression to be sure that it will be obvious to everyone that integer(0) is *not* an empty index, and less so with NULL. what is meant, i guess, is 'empty index expression', i.e., no index rather than empty index, and i'd humbly suggest (risking being charged with boring pedantry) to improve tfm. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
On Mon, 19 Jan 2009, Rolf Turner wrote: On 19/01/2009, at 10:44 AM, Gabor Grothendieck wrote: Well, that's why it was only provided when you insisted. This is not what regexp's are good at. On Sun, Jan 18, 2009 at 4:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Thanks! (I have to admit, though, that I expected something simple) It may not be what regexp's are good at, but the grep command in unix/linux does what is required *very* simply via the ``-v'' flag. I conjecture that it would not be difficult to add an argument with similar impact to the grep() function in R. Indeed. I have often wondered why grep() returned indices, when a logical vector would seem more natural in R (and !grep(...) would have been all that was needed). Looking at the code I see it does in fact compute a logical vector, just not return it. So adding 'invert' (the long-form of -v is --invert) is a job of a very few lines and I have done so for 2.9.0. -- Brian D. Ripley, rip...@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] regex - negate a word
Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). Since I am not really experienced with regular expressions, I started slowly and thought I find all word were 'abc' actually does appear: grep(pattern=abc, x=x) [1] 1 2 So far, so good. Now I read that ^ is the negation operator. But it can also denote the beginning of a string as in: grep(pattern=^abc, x=x) [1] 1 Of course, we need to put it inside square brackets to negate the expression [1] grep(pattern=[^abc], x=x) [1] 1 2 3 But this is not what I want either. I'd appreciate any help. I assume this is rather easy and straightforward. Thanks, Roland [1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or caret) inside square brackets negates the expression -- This mail has been sent through the MPI for Demographic Research. Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Just remove those elements that match: x - c(abcdef, defabc, qwerty) x[-grep('abc',x)] [1] qwerty On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). Since I am not really experienced with regular expressions, I started slowly and thought I find all word were 'abc' actually does appear: grep(pattern=abc, x=x) [1] 1 2 So far, so good. Now I read that ^ is the negation operator. But it can also denote the beginning of a string as in: grep(pattern=^abc, x=x) [1] 1 Of course, we need to put it inside square brackets to negate the expression [1] grep(pattern=[^abc], x=x) [1] 1 2 3 But this is not what I want either. I'd appreciate any help. I assume this is rather easy and straightforward. Thanks, Roland [1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or caret) inside square brackets negates the expression -- This mail has been sent through the MPI for Demographic Research. Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Rau, Roland wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). a quick shot is: x[-grep(abc, x)] which unfortunately fails if none of the strings in x matches the pattern, i.e., grep returns integer(0); arguably, x[integer(0)] should rather return all elements of x: An empty index selects all values (from ?'[') but apparently integer(0) does not count as an empty index (and neither does NULL). so you may want something like: strings = c(abcdef, defabc, qwerty) pattern = abc if (length(matching - grep(pattern, strings))) x[-matching] else x vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Try this: # indexes setdiff(seq_along(x), grep(abc, x)) # values setdiff(x, grep(abc, x, value = TRUE)) Another possibility is: z - abc x0 - c(x, z) # to handle no match case x0[- grep(z, x0)] # values On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). Since I am not really experienced with regular expressions, I started slowly and thought I find all word were 'abc' actually does appear: grep(pattern=abc, x=x) [1] 1 2 So far, so good. Now I read that ^ is the negation operator. But it can also denote the beginning of a string as in: grep(pattern=^abc, x=x) [1] 1 Of course, we need to put it inside square brackets to negate the expression [1] grep(pattern=[^abc], x=x) [1] 1 2 3 But this is not what I want either. I'd appreciate any help. I assume this is rather easy and straightforward. Thanks, Roland [1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or caret) inside square brackets negates the expression -- This mail has been sent through the MPI for Demographic Research. Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Roland, I think you were almost there with your first example. Howabout using: x - c(abcdef, defabc, qwerty) y - grep(pattern=abc, x=x) z.char - x[-y] z.index - (1:length(x))[-y] z.char [1] qwerty z.index [1] 3 Cheers, eric Rau, Roland wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). Since I am not really experienced with regular expressions, I started slowly and thought I find all word were 'abc' actually does appear: grep(pattern=abc, x=x) [1] 1 2 So far, so good. Now I read that ^ is the negation operator. But it can also denote the beginning of a string as in: grep(pattern=^abc, x=x) [1] 1 Of course, we need to put it inside square brackets to negate the expression [1] grep(pattern=[^abc], x=x) [1] 1 2 3 But this is not what I want either. I'd appreciate any help. I assume this is rather easy and straightforward. Thanks, Roland [1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or caret) inside square brackets negates the expression -- This mail has been sent through the MPI for Demographic Research. Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Eric Archer, Ph.D. Southwest Fisheries Science Center 8604 La Jolla Shores Dr. La Jolla, CA 92037 858-546-7121 (work) 858-546-7003 (FAX) ETP Cetacean Assessment Program: http://swfsc.noaa.gov/prd-etp.aspx Population ID Program: http://swfsc.noaa.gov/prd-popid.aspx Innocence about Science is the worst crime today. - Sir Charles Percy Snow Lighthouses are more helpful than churches. - Benjamin Franklin ...but I'll take a GPS over either one. - John C. Craig George __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Jorge Ivan Velez wrote: Hi Wacek, I think you wanted to say strings instead x in your last line : ) of course, thanks. the correct version is: if(length(matching - grep(pattern, strings))) strings[-matching] else strings btw., and in relation to a recent post complaining about how the mailing list is maintained, i must say that although the idea that posts could be edited after they've been sent does may not sound good in general, i think it would be useful to be able to just fix such minor typos in place instead of posting a correction. after all, the list is intended to serve as help to those who care not only to ask, but also to browse the archives. but this is a side comment, i take no sides and make no recommendations. vQ Best, Jorge On Sun, Jan 18, 2009 at 2:22 PM, Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: Rau, Roland wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). a quick shot is: x[-grep(abc, x)] which unfortunately fails if none of the strings in x matches the pattern, i.e., grep returns integer(0); arguably, x[integer(0)] should rather return all elements of x: An empty index selects all values (from ?'[') but apparently integer(0) does not count as an empty index (and neither does NULL). so you may want something like: strings = c(abcdef, defabc, qwerty) pattern = abc if (length(matching - grep(pattern, strings))) x[-matching] else x vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Thank you very much to all of you for your fast and excellent help. Since the -grep(...) solution seems to be favored by most of the answers, I just wonder if there is really no regular expression which does the job?!? Thanks again, Roland -Original Message- From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Sent: Sun 1/18/2009 8:28 PM To: Rau, Roland Cc: r-help@r-project.org Subject: Re: [R] regex - negate a word Try this: # indexes setdiff(seq_along(x), grep(abc, x)) # values setdiff(x, grep(abc, x, value = TRUE)) Another possibility is: z - abc x0 - c(x, z) # to handle no match case x0[- grep(z, x0)] # values On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). Since I am not really experienced with regular expressions, I started slowly and thought I find all word were 'abc' actually does appear: grep(pattern=abc, x=x) [1] 1 2 So far, so good. Now I read that ^ is the negation operator. But it can also denote the beginning of a string as in: grep(pattern=^abc, x=x) [1] 1 Of course, we need to put it inside square brackets to negate the expression [1] grep(pattern=[^abc], x=x) [1] 1 2 3 But this is not what I want either. I'd appreciate any help. I assume this is rather easy and straightforward. Thanks, Roland [1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or caret) inside square brackets negates the expression -- This mail has been sent through the MPI for Demographic Research. Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Try this: grep(^([^a]|a[^b]|ab[^c])*.{0,2}$, x, perl = TRUE) On Sun, Jan 18, 2009 at 2:37 PM, Rau, Roland r...@demogr.mpg.de wrote: Thank you very much to all of you for your fast and excellent help. Since the -grep(...) solution seems to be favored by most of the answers, I just wonder if there is really no regular expression which does the job?!? Thanks again, Roland -Original Message- From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Sent: Sun 1/18/2009 8:28 PM To: Rau, Roland Cc: r-help@r-project.org Subject: Re: [R] regex - negate a word Try this: # indexes setdiff(seq_along(x), grep(abc, x)) # values setdiff(x, grep(abc, x, value = TRUE)) Another possibility is: z - abc x0 - c(x, z) # to handle no match case x0[- grep(z, x0)] # values On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). Since I am not really experienced with regular expressions, I started slowly and thought I find all word were 'abc' actually does appear: grep(pattern=abc, x=x) [1] 1 2 So far, so good. Now I read that ^ is the negation operator. But it can also denote the beginning of a string as in: grep(pattern=^abc, x=x) [1] 1 Of course, we need to put it inside square brackets to negate the expression [1] grep(pattern=[^abc], x=x) [1] 1 2 3 But this is not what I want either. I'd appreciate any help. I assume this is rather easy and straightforward. Thanks, Roland [1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or caret) inside square brackets negates the expression -- This mail has been sent through the MPI for Demographic Research. Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Gabor Grothendieck wrote: Try this: # values setdiff(x, grep(abc, x, value = TRUE)) Another possibility is: z - abc x0 - c(x, z) # to handle no match case x0[- grep(z, x0)] # values on quick testing, these two and the if-based version have comparable runtime, with a minor win for the last one, and if the input is moderate this makes no real difference. however, the second solution above is likely to fail if the pattern is more complex, e.g., contains a character class or a wildcard: strings = c(xyz) pattern = a[a-z] strings[-grep(pattern, c(strings, pattern))] # character(0) vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
In that case just add fixed = TRUE On Sun, Jan 18, 2009 at 2:58 PM, Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: Gabor Grothendieck wrote: Try this: # values setdiff(x, grep(abc, x, value = TRUE)) Another possibility is: z - abc x0 - c(x, z) # to handle no match case x0[- grep(z, x0)] # values on quick testing, these two and the if-based version have comparable runtime, with a minor win for the last one, and if the input is moderate this makes no real difference. however, the second solution above is likely to fail if the pattern is more complex, e.g., contains a character class or a wildcard: strings = c(xyz) pattern = a[a-z] strings[-grep(pattern, c(strings, pattern))] # character(0) vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Gabor Grothendieck wrote: In that case just add fixed = TRUE in general, if you want a complex pattern, you don't use 'fixed', and then again you risk incorrect (well, correct for r, but not for the problem) result in case no input string matches the pattern. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Gabor Grothendieck wrote: Try this: grep(^([^a]|a[^b]|ab[^c])*.{0,2}$, x, perl = TRUE) ... and see how cumbersome it becomes for a pattern as trivial as 'abc'. in perl, you typically don't invent such negative patterns, but rather don't match positive patterns: instead of the match operator =~ and a negative pattern, you use the no-match operator !~ and a positive pattern: @strings = (abc, xyz); @filtered = grep $_ !~ /abc/, @strings; in r, one way to do the no-match is using -grep, but taking care of the special case of no matches at all in the input vector. On Sun, Jan 18, 2009 at 2:37 PM, Rau, Roland r...@demogr.mpg.de wrote: Thank you very much to all of you for your fast and excellent help. Since the -grep(...) solution seems to be favored by most of the answers, I just wonder if there is really no regular expression which does the job?!? in perl 5.10, you can try this: @strings = (abc, xyz); @filtered = grep $_ =~ /(abc)(*COMMIT)(*FAIL)|(*ACCEPT)/, @strings; which works by making a string that matches the pattern fail, and any other string succeed despite no match. vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Wacek Kusnierczyk wrote: On Sun, Jan 18, 2009 at 2:37 PM, Rau, Roland r...@demogr.mpg.de wrote: Thank you very much to all of you for your fast and excellent help. Since the -grep(...) solution seems to be favored by most of the answers, I just wonder if there is really no regular expression which does the job?!? in perl 5.10, you can try this: @strings = (abc, xyz); @filtered = grep $_ =~ /(abc)(*COMMIT)(*FAIL)|(*ACCEPT)/, @strings; which works by making a string that matches the pattern fail, and any other string succeed despite no match. incidentally, recent pcre accepts such regexes: # r code ungrep = function(pattern, x, ...) grep(paste(pattern, (*COMMIT)(*FAIL)|(*ACCEPT), sep=), x, perl=TRUE, ...) strings = c(abc, xyz) pattern = a[a-z] (filtered = strings[ungrep(pattern, strings)]) # xyz vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Wacek Kusnierczyk wrote: # r code ungrep = function(pattern, x, ...) grep(paste(pattern, (*COMMIT)(*FAIL)|(*ACCEPT), sep=), x, perl=TRUE, ...) strings = c(abc, xyz) pattern = a[a-z] (filtered = strings[ungrep(pattern, strings)]) # xyz this was a toy example, but if you need this sort of ungrep with patterns involving alterations, you need a fix: ungrep(a|x, strings, value=TRUE) # abc # NOT character(0) # fix ungrep = function(pattern, x, ...) grep(paste((?:, pattern, )(*COMMIT)(*FAIL)|(*ACCEPT), sep=), x, perl=TRUE, ...) ungrep(a|x, strings, value=TRUE) # character(0) vQ __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Thanks! (I have to admit, though, that I expected something simple) Thanks, Roland -Original Message- From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Sent: Sun 1/18/2009 8:54 PM To: Rau, Roland Cc: r-help@r-project.org Subject: Re: [R] regex - negate a word Try this: grep(^([^a]|a[^b]|ab[^c])*.{0,2}$, x, perl = TRUE) On Sun, Jan 18, 2009 at 2:37 PM, Rau, Roland r...@demogr.mpg.de wrote: Thank you very much to all of you for your fast and excellent help. Since the -grep(...) solution seems to be favored by most of the answers, I just wonder if there is really no regular expression which does the job?!? Thanks again, Roland -Original Message- From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Sent: Sun 1/18/2009 8:28 PM To: Rau, Roland Cc: r-help@r-project.org Subject: Re: [R] regex - negate a word Try this: # indexes setdiff(seq_along(x), grep(abc, x)) # values setdiff(x, grep(abc, x, value = TRUE)) Another possibility is: z - abc x0 - c(x, z) # to handle no match case x0[- grep(z, x0)] # values On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). Since I am not really experienced with regular expressions, I started slowly and thought I find all word were 'abc' actually does appear: grep(pattern=abc, x=x) [1] 1 2 So far, so good. Now I read that ^ is the negation operator. But it can also denote the beginning of a string as in: grep(pattern=^abc, x=x) [1] 1 Of course, we need to put it inside square brackets to negate the expression [1] grep(pattern=[^abc], x=x) [1] 1 2 3 But this is not what I want either. I'd appreciate any help. I assume this is rather easy and straightforward. Thanks, Roland [1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or caret) inside square brackets negates the expression -- This mail has been sent through the MPI for Demographic Research. Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Well, that's why it was only provided when you insisted. This is not what regexp's are good at. On Sun, Jan 18, 2009 at 4:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Thanks! (I have to admit, though, that I expected something simple) Thanks, Roland -Original Message- From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Sent: Sun 1/18/2009 8:54 PM To: Rau, Roland Cc: r-help@r-project.org Subject: Re: [R] regex - negate a word Try this: grep(^([^a]|a[^b]|ab[^c])*.{0,2}$, x, perl = TRUE) On Sun, Jan 18, 2009 at 2:37 PM, Rau, Roland r...@demogr.mpg.de wrote: Thank you very much to all of you for your fast and excellent help. Since the -grep(...) solution seems to be favored by most of the answers, I just wonder if there is really no regular expression which does the job?!? Thanks again, Roland -Original Message- From: Gabor Grothendieck [mailto:ggrothendi...@gmail.com] Sent: Sun 1/18/2009 8:28 PM To: Rau, Roland Cc: r-help@r-project.org Subject: Re: [R] regex - negate a word Try this: # indexes setdiff(seq_along(x), grep(abc, x)) # values setdiff(x, grep(abc, x, value = TRUE)) Another possibility is: z - abc x0 - c(x, z) # to handle no match case x0[- grep(z, x0)] # values On Sun, Jan 18, 2009 at 1:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Dear all, let's assume I have a vector of character strings: x - c(abcdef, defabc, qwerty) What I would like to find is the following: all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). Since I am not really experienced with regular expressions, I started slowly and thought I find all word were 'abc' actually does appear: grep(pattern=abc, x=x) [1] 1 2 So far, so good. Now I read that ^ is the negation operator. But it can also denote the beginning of a string as in: grep(pattern=^abc, x=x) [1] 1 Of course, we need to put it inside square brackets to negate the expression [1] grep(pattern=[^abc], x=x) [1] 1 2 3 But this is not what I want either. I'd appreciate any help. I assume this is rather easy and straightforward. Thanks, Roland [1] http://www.zytrax.com/tech/web/regex.htm: The ^ (circumflex or caret) inside square brackets negates the expression -- This mail has been sent through the MPI for Demographic Research. Should you receive a mail that is apparently from a MPI user without this text displayed, then the address has most likely been faked. If you are uncertain about the validity of this message, please check the mail header or ask your system administrator for assistance. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
On 19/01/2009, at 10:44 AM, Gabor Grothendieck wrote: Well, that's why it was only provided when you insisted. This is not what regexp's are good at. On Sun, Jan 18, 2009 at 4:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Thanks! (I have to admit, though, that I expected something simple) It may not be what regexp's are good at, but the grep command in unix/ linux does what is required *very* simply via the ``-v'' flag. I conjecture that it would not be difficult to add an argument with similar impact to the grep() function in R. cheers, Rolf Turner ## Attention:\ This e-mail message is privileged and confid...{{dropped:9}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
That's an entirely different point from whether regular expressions can do it as grep -v is just another way to do it without using a regular expression to specify the entire job. On Sun, Jan 18, 2009 at 5:02 PM, Rolf Turner r.tur...@auckland.ac.nz wrote: On 19/01/2009, at 10:44 AM, Gabor Grothendieck wrote: Well, that's why it was only provided when you insisted. This is not what regexp's are good at. On Sun, Jan 18, 2009 at 4:35 PM, Rau, Roland r...@demogr.mpg.de wrote: Thanks! (I have to admit, though, that I expected something simple) It may not be what regexp's are good at, but the grep command in unix/linux does what is required *very* simply via the ``-v'' flag. I conjecture that it would not be difficult to add an argument with similar impact to the grep() function in R. cheers, Rolf Turner ## Attention:This e-mail message is privileged and confidential. If you are not theintended recipient please delete the message and notify the sender.Any views or opinions presented are solely those of the author. This e-mail has been scanned and cleared by MailMarshalwww.marshalsoftware.com ## __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
On Sun, Jan 18, 2009 at 2:22 PM, Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: x - c(abcdef, defabc, qwerty) ...[find] all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). x[-grep(abc, x)] which unfortunately fails if none of the strings in x matches the pattern, i.e., grep returns integer(0); Yes. arguably, x[integer(0)] should rather return all elements of x The meaning of x[V] (for an integer subscript vector V) is: ignore 0 entries, and then: a) if !(all(V0) | all(V0) ) = ERROR b) if all (V0): length(x[V]) == length(V) c) if all (V0): length(x[V]) == length(x)-length(unique(V)) When length(V)==0, the preconditions are true for both (b) and (c), so the R design has made the decision that length(x[V]) == 0 in this case. If you're going to have the negative indices means exclusion trick, this seems like a reasonable convention. Of course, that means that you can't in general use x[-V] (where all(V0)) to mean all elements that are not in V. However, there is a workaround if you have an upper bound on length(x): x[ c(-2^30, -V) ] This guarantees at least one negative number. -s __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regex - negate a word
Note that the variation of this that I posted already handles that case. On Sun, Jan 18, 2009 at 5:32 PM, Stavros Macrakis macra...@alum.mit.edu wrote: On Sun, Jan 18, 2009 at 2:22 PM, Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: x - c(abcdef, defabc, qwerty) ...[find] all elements where the word 'abc' does not appear (i.e. 3 in this case of 'x'). x[-grep(abc, x)] which unfortunately fails if none of the strings in x matches the pattern, i.e., grep returns integer(0); Yes. arguably, x[integer(0)] should rather return all elements of x The meaning of x[V] (for an integer subscript vector V) is: ignore 0 entries, and then: a) if !(all(V0) | all(V0) ) = ERROR b) if all (V0): length(x[V]) == length(V) c) if all (V0): length(x[V]) == length(x)-length(unique(V)) When length(V)==0, the preconditions are true for both (b) and (c), so the R design has made the decision that length(x[V]) == 0 in this case. If you're going to have the negative indices means exclusion trick, this seems like a reasonable convention. Of course, that means that you can't in general use x[-V] (where all(V0)) to mean all elements that are not in V. However, there is a workaround if you have an upper bound on length(x): x[ c(-2^30, -V) ] This guarantees at least one negative number. -s __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.