x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R"))and then run
d <- replicate(1000, length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)))
> summary(d) Min. 1st Qu. Median Mean 3rd Qu. Max. 13.00 13.00 13.00 30.47 13.00 84.00
Similar behavior on both R 1.9.0 and today's R-patched (I'm running on Linux). To me this smells like a memory issue in PCRE.
-roger
Martin Maechler wrote:
"Roger" == Roger D Peng <[EMAIL PROTECTED]> on Fri, 11 Jun 2004 10:43:57 -0400 writes:
Roger> I've noticed a change in the way grep() behaves between the 1.9.0 Roger> release and a recent R-patched. On 1.9.0 I get the following output:
>> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R")) >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)) Roger> [1] 84
Roger> And on R-patched (2004-06-11) I get
>> x <- dget(file = url("http://www.biostat.jhsph.edu/~rpeng/names.R")) >> length(grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)) Roger> [1] 13
I can reproduce this exactly.
<....>
Roger> I didn't find anything in the NEWs file that would indicate a change
yes: The src/extras/pcre/ (Perl Compatible Regular Expressions) library was upgraded, and since we assumed that wouldn't have any effect --- as we now see, a too optimistically --- it wasn't documented in NEWS
Roger> and another problem is that I'm not sure which behavior is correct. Roger> My knowledge of regular expressions is limited.
The first one is correct I think: '\w' means word constituents
(see below) and for 1.9.0, you get
> grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
[1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean" [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean" [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean" "l3pm25tmean" "l3cotmean" [16] "l3no2tmean" "l3so2tmean" "l3o3tmean" "l4pm10tmean" "l4pm25tmean" [21] "l4cotmean" "l4no2tmean" "l4so2tmean" "l4o3tmean" "l5pm10tmean" [26] "l5pm25tmean" "l5cotmean" "l5no2tmean" "l5so2tmean" "l5o3tmean" [31] "l6pm10tmean" "l6pm25tmean" "l6cotmean" "l6no2tmean" "l6so2tmean" [36] "l6o3tmean" "l7pm10tmean" "l7pm25tmean" "l7cotmean" "l7no2tmean" [41] "l7so2tmean" "l7o3tmean" "lm1pm10tmean" "lm1pm25tmean" "lm1cotmean" [46] "lm1no2tmean" "lm1so2tmean" "lm1o3tmean" "lm2pm10tmean" "lm2pm25tmean"
[51] "lm2cotmean" "lm2no2tmean" "lm2so2tmean" "lm2o3tmean" "lm3pm10tmean"
[56] "lm3pm25tmean" "lm3cotmean" "lm3no2tmean" "lm3so2tmean" "lm3o3tmean" [61] "lm4pm10tmean" "lm4pm25tmean" "lm4cotmean" "lm4no2tmean" "lm4so2tmean" [66] "lm4o3tmean" "lm5pm10tmean" "lm5pm25tmean" "lm5cotmean" "lm5no2tmean" [71] "lm5so2tmean" "lm5o3tmean" "lm6pm10tmean" "lm6pm25tmean" "lm6cotmean" [76] "lm6no2tmean" "lm6so2tmean" "lm6o3tmean" "lm7pm10tmean" "lm7pm25tmean"
[81] "lm7cotmean" "lm7no2tmean" "lm7so2tmean" "lm7o3tmean" >
which is correct AFAICS and shouldn't be shorted to the only 13 elements
grep("^l\\w+tmean", x, perl = TRUE, value = TRUE)
[1] "l1pm10tmean" "l1pm25tmean" "l1cotmean" "l1no2tmean" "l1so2tmean" [6] "l1o3tmean" "l2pm10tmean" "l2pm25tmean" "l2cotmean" "l2no2tmean" [11] "l2so2tmean" "l2o3tmean" "l3pm10tmean"
in R-patched.
------------
For me, 'man perlre' contains
\w Match a "word" character (alphanumeric plus "_")
<......>
A "\w" matches a single alphanumeric character or "_", not a whole word. Use "\w+" to match a string of Perl-identifier characters (which isn't the same as matching an English word). If "use locale" is in effect, the list of alphabetic characters generated by "\w" is taken from the current locale. See the perllocale manpage. .......
so it may well be connected to locale problems. But I don't
think any locale should have "l2pm25tmean" matched by '^l\w+tmean' but not match
"lm5pm25tmean"
[If making a difference between these two, it should rather be the other way round].
Martin Maechler
______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
