Re: [Rd] Change in grep behavior from 1.9.0 to R-patched

Roger D. Peng Fri, 11 Jun 2004 08:37:48 -0700

To make matters a little more interesting, I get some weird behavior on R 1.9.0 also. For example, when I run

x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R";))

and then run

d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)))

> summary(d)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  13.00   13.00   13.00   30.47   13.00   84.00

Similar behavior on both R 1.9.0 and today's R-patched (I'm running on Linux). To me this smells like a memory issue in PCRE.

-roger

Martin Maechler wrote:

"Roger" == Roger D Peng <[EMAIL PROTECTED]>
   on Fri, 11 Jun 2004 10:43:57 -0400 writes:
Roger> I've noticed a change in the way grep() behaves between the 1.9.0 Roger> release and a recent R-patched. On 1.9.0 I get the following output:
    >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R";))
    >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
    Roger> [1] 84
    Roger> And on R-patched (2004-06-11) I get
    >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R";))
    >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
    Roger> [1] 13
I can reproduce this exactly.
    <....>
Roger> I didn't find anything in the NEWs file that would indicate a change
yes: The src/extras/pcre/ (Perl Compatible Regular Expressions)
     library was upgraded, and since we assumed that wouldn't
     have any effect --- as we now see, a too optimistically ---
     it wasn't documented in NEWS
Roger> and another problem is that I'm not sure which behavior is correct. Roger> My knowledge of regular expressions is limited.

The first one is correct I think: '\w' means word constituents (see below) and for 1.9.0, you get

> grep("^l\\w+tmean", x, perl = TRUE, value = TRUE) [1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean" [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean" [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean" "l3pm25tmean" "l3cotmean" [16] "l3no2tmean" "l3so2tmean" "l3o3tmean" "l4pm10tmean" "l4pm25tmean" [21] "l4cotmean" "l4no2tmean" "l4so2tmean" "l4o3tmean" "l5pm10tmean" [26] "l5pm25tmean" "l5cotmean" "l5no2tmean" "l5so2tmean" "l5o3tmean" [31] "l6pm10tmean" "l6pm25tmean" "l6cotmean" "l6no2tmean" "l6so2tmean" [36] "l6o3tmean" "l7pm10tmean" "l7pm25tmean" "l7cotmean" "l7no2tmean" [41] "l7so2tmean" "l7o3tmean" "lm1pm10tmean" "lm1pm25tmean" "lm1cotmean" [46] "lm1no2tmean" "lm1so2tmean" "lm1o3tmean" "lm2pm10tmean" "lm2pm25tmean" [51] "lm2cotmean" "lm2no2tmean" "lm2so2tmean" "lm2o3tmean" "lm3pm10tmean" [56] "lm3pm25tmean" "lm3cotmean" "lm3no2tmean" "lm3so2tmean" "lm3o3tmean" [61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean" "lm4no2tmean" "lm4so2tmean" [66] "lm4o3tmean" "lm5pm10tmean" "lm5pm25tmean" "lm5cotmean" "lm5no2tmean" [71] "lm5so2tmean" "lm5o3tmean" "lm6pm10tmean" "lm6pm25tmean" "lm6cotmean" [76] "lm6no2tmean" "lm6so2tmean" "lm6o3tmean" "lm7pm10tmean" "lm7pm25tmean" [81] "lm7cotmean" "lm7no2tmean" "lm7so2tmean" "lm7o3tmean" >
which is correct AFAICS and shouldn't be shorted to the only 13 elements
grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
[1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean" [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean" [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean"
in R-patched.
------------
For me,  'man perlre' contains
\w Match a "word" character (alphanumeric plus "_")
         <......>
   A "\w" matches a single alphanumeric character or "_", not a whole
   word.  Use "\w+" to match a string of Perl-identifier characters (which
   isn't the same as matching an English word).  If "use locale" is in
   effect, the list of alphabetic characters generated by "\w" is taken
   from the current locale.  See the perllocale manpage. .......
so it may well be connected to locale problems. But I don't think any locale should have "l2pm25tmean" matched by '^l\w+tmean' but not match "lm5pm25tmean"
[If making a difference between these two, it should rather be
 the other way round].
Martin Maechler


______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Change in grep behavior from 1.9.0 to R-patched

Reply via email to