(diverted to r-devel, a source code patch attached) Wacek Kusnierczyk wrote: > Allan Engelhardt wrote: > >> Immaterial, yes, but it is always good to test :) and your solution >> *is* faster and it is even faster if you can assume byte strings: >> > > :) > > indeed; though if the speed is immaterial (and in this case it > supposedly was), it's probably not worth risking fixed=TRUE removing > '.tif' from the middle of the name, however unlikely this might be (cf > murphy's laws). > > but if you can assume that each string ends with a '.tif' (or any other > \..{3} substring), then substr is marginally faster than sub, even as a > three-pass approach, while avoiding the risk of removing '.tif' from the > middle: > > strings = sprintf('f:/foo/bar//%s.tif', replicate(1000, > paste(sample(letters, 10), collapse=''))) > library(rbenchmark) > benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL, > substr={basenames=basename(strings); substr(basenames, 1, > nchar(basenames)-4)}, > sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE)) > # test elapsed > # 1 substr 3.176 > # 2 sub 3.296 >
btw., i wonder why negative indices default to 1 in substr: substr('foobar', -5, 5) # "fooba" # substr('foobar', 1, 5) substr('foobar', 2, -2) # "" # substr('foobar', 2, 1) this does not seem to be documented in ?substr. there are ways to make negative indices meaningful, e.g., by taking them as indexing from behind (as in, e.g., perl): # hypothetical substr('foobar', -5, 5) # "ooba" # substr('foobar', 6-5+1, 5) substr('foobar', 2, -2) # "ooba" # substr('foobar', 2, 6-2+1) there is a trivial fix to src/main/character.c that gives substr the extended functionality -- see the attached patch. the patch has been created and tested as follows: svn co https://svn.r-project.org/R/trunk r-devel cd r-devel # modifications made to src/main/character.c svn diff > character.c.diff svn revert -R . patch -p0 < character.c.diff ./configure make make check-all # no problems reported with the patched substr, the original problem can now be solved more concisely, using a two-pass approach, with performance still better than the sub/fixed/bytes one, as follows: strings = sprintf('f:/foo/bar//%s.tif', replicate(1000, paste(sample(letters, 10), collapse=''))) library(rbenchmark) benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL, substr=substr(basename(strings), 1, -5), 'substr-nchar'={ basenames=basename(strings) substr(basenames, 1, nchar(basenames)-4) }, sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE)) # test elapsed # 1 substr 2.981 # 2 substr-nchar 3.206 # 3 sub 3.273 if this sounds interesting, i can update the docs accordingly. vQ
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.