To make matters a little more interesting, I get some weird behavior on R 1.9.0 also. For example, when I run

x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R";))

and then run

d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)))

> summary(d)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  13.00   13.00   13.00   30.47   13.00   84.00

Similar behavior on both R 1.9.0 and today's R-patched (I'm running on Linux). To me this smells like a memory issue in PCRE.

-roger


Martin Maechler wrote:
"Roger" == Roger D Peng <[EMAIL PROTECTED]>
   on Fri, 11 Jun 2004 10:43:57 -0400 writes:


Roger> I've noticed a change in the way grep() behaves between the 1.9.0 Roger> release and a recent R-patched. On 1.9.0 I get the following output:

    >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R";))
    >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
    Roger> [1] 84

    Roger> And on R-patched (2004-06-11) I get

    >> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R";))
    >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE))
    Roger> [1] 13

I can reproduce this exactly.

    <....>

Roger> I didn't find anything in the NEWs file that would indicate a change

yes: The src/extras/pcre/ (Perl Compatible Regular Expressions)
     library was upgraded, and since we assumed that wouldn't
     have any effect --- as we now see, a too optimistically ---
     it wasn't documented in NEWS

Roger> and another problem is that I'm not sure which behavior is correct. Roger> My knowledge of regular expressions is limited.

The first one is correct I think: '\w' means word constituents
(see below) and for 1.9.0, you get


> grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
[1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean" [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean" [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean" "l3pm25tmean" "l3cotmean" [16] "l3no2tmean" "l3so2tmean" "l3o3tmean" "l4pm10tmean" "l4pm25tmean" [21] "l4cotmean" "l4no2tmean" "l4so2tmean" "l4o3tmean" "l5pm10tmean" [26] "l5pm25tmean" "l5cotmean" "l5no2tmean" "l5so2tmean" "l5o3tmean" [31] "l6pm10tmean" "l6pm25tmean" "l6cotmean" "l6no2tmean" "l6so2tmean" [36] "l6o3tmean" "l7pm10tmean" "l7pm25tmean" "l7cotmean" "l7no2tmean" [41] "l7so2tmean" "l7o3tmean" "lm1pm10tmean" "lm1pm25tmean" "lm1cotmean" [46] "lm1no2tmean" "lm1so2tmean" "lm1o3tmean" "lm2pm10tmean" "lm2pm25tmean"
[51] "lm2cotmean" "lm2no2tmean" "lm2so2tmean" "lm2o3tmean" "lm3pm10tmean"
[56] "lm3pm25tmean" "lm3cotmean" "lm3no2tmean" "lm3so2tmean" "lm3o3tmean" [61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean" "lm4no2tmean" "lm4so2tmean" [66] "lm4o3tmean" "lm5pm10tmean" "lm5pm25tmean" "lm5cotmean" "lm5no2tmean" [71] "lm5so2tmean" "lm5o3tmean" "lm6pm10tmean" "lm6pm25tmean" "lm6cotmean" [76] "lm6no2tmean" "lm6so2tmean" "lm6o3tmean" "lm7pm10tmean" "lm7pm25tmean"
[81] "lm7cotmean" "lm7no2tmean" "lm7so2tmean" "lm7o3tmean" >


which is correct AFAICS and shouldn't be shorted to the only 13 elements


grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)

[1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean" [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean" [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean"


in R-patched.

------------

For me,  'man perlre' contains


\w Match a "word" character (alphanumeric plus "_")


         <......>

   A "\w" matches a single alphanumeric character or "_", not a whole
   word.  Use "\w+" to match a string of Perl-identifier characters (which
   isn't the same as matching an English word).  If "use locale" is in
   effect, the list of alphabetic characters generated by "\w" is taken
   from the current locale.  See the perllocale manpage. .......


so it may well be connected to locale problems. But I don't
think any locale should have "l2pm25tmean" matched by '^l\w+tmean' but not match
"lm5pm25tmean"


[If making a difference between these two, it should rather be
 the other way round].

Martin Maechler




______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel

Reply via email to