On 11/27/05, John Logsdon <[EMAIL PROTECTED]> wrote: > Hello > > I know that R's string functions are not as extensive as those of Unix but
I don't think this statement is true although I have seen it repeated. > I need to do some text handling totally within an R environment because > the target is a Windows system which will not have the corresponding shell > utilities, sed, awk etc. Free versions of these utilities are available for Windows although they don't come with Windows. e.g. Google for gawk. > > Can anyone explain the following gsub phenomenon to me: > > > dates<-c("73","74","02","1973","1974","2002") > > I want to take just the last two digits where it is a 4-digit year and > both digits when it is a 2-digit year. I should be able to use substr but > measurement from the string end (with a negative counter or something) is > not implemented: > > > substr(dates,3,4) > [1] "" "" "" "73" "74" "02" > > substr(dates,-2,4) > [1] "73" "74" "02" "1973" "1974" "2002" > > substr(dates,4,-2) > [1] "" "" "" "" "" "" > > So I tried gsub: > > > gsub("[19|20]([0-9][0-9])","\\1",dates) > [1] "73" "74" "02" "973" "974" "002" > > As I understand it (and comparing with sed), the \\1 should take the first > bracketed string but clearly this doesn't work. If I try what should also > work: > > > gsub("[19|20]([0-9])([0-9])","\\1\\2",dates) > [1] "73" "74" "02" "973" "974" "002" > > On the other hand the following does work: > > > gsub("[19|20]([0-9])([0-9])","\\2",dates) > [1] "73" "74" "02" "73" "74" "02" > > So it appears that the substitution takes one character extra to the left > but the following indicates that the lower limit of the selected range is > also at fault: > > > s<-c("1","12","123","1234","12345","123456") > > gsub("[12]([4-6]*)","",s) > [1] "" "" "3" "34" "345" "3456" > > Probably more elegant examples could be constructed that could home in on > the issue. > > The version is R 2.0.1 on Linux so perhaps it is a little old now. > > Questions: > > 1) Am I misunderstanding the gsub use? > > 2) Was it a bug that has since been corrected? > > 3) Is it still a bug in the latest version? > It works the same on my system which is 2.2.0 Windows patched (2005-10-24). At first I too thought it was a bug but I noticed it works the same in perl so now I am not sure. The following perl program under Windows using perl 5.8.6 on Windows gives 002 as the answer as the answer too: $_ = "2002"; s/[19|20]([0-9])([0-9])/\1\2/g; print; In any any case, it could be done like this: sub(".*(..)$", "\\1", dates) or substring(dates, nchar(dates)-1) or the following which appends -01-01 to the year, converts it to Date class, implicitly converts it back to character and then extracts the 3rd to 4th character of the result: substring(as.Date(sprintf("%s-01-01", dates)), 3, 4) or ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html