[I originally posted this on the R-help mailing list, and it was suggested that R-devel would be a better place to dicuss it.]
Running ‘table’ on a factor with levels containing non-ASCII characters seems to result in extremely bad performance on Windows. Here’s a simple example with benchmark results (I’ve reduced the number of replications to make the function finish within reasonable time): library(rbenchmark) x.num=sample(1:2, 10^5, replace=TRUE) x.fac.ascii=factor(x.num, levels=1:2, labels=c("A","B")) x.fac.nascii=factor(x.num, levels=1:2, labels=c("Æ","Ø")) benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), table(unclass(x.fac.nascii)), replications=20 ) test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 20 1.53 4.636364 1.51 0.01 NA NA 2 table(x.fac.ascii) 20 0.33 1.000000 0.33 0.00 NA NA 3 table(x.fac.nascii) 20 146.67 444.454545 38.52 81.74 NA NA 1 table(x.num) 20 1.55 4.696970 1.53 0.01 NA NA sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=Norwegian-Nynorsk_Norway.1252 LC_CTYPE=Norwegian-Nynorsk_Norway.1252 LC_MONETARY=Norwegian-Nynorsk_Norway.1252 [4] LC_NUMERIC=C LC_TIME=Norwegian-Nynorsk_Norway.1252 attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] rbenchmark_0.3 The timings are from R 2.12.1, but I also get comparable results on the latest prelease (R 2.13.0 2011-01-18 r54032). Running the same test (100 replications) on a Linux system with R.12.1 Patched results in essentially no difference between the performance on ASCII factors and non-ASCII factors: test replications elapsed relative user.self sys.self user.child sys.child 4 table(unclass(x.fac.nascii)) 100 4.607 3.096102 4.455 0.092 0 0 2 table(x.fac.ascii) 100 1.488 1.000000 1.459 0.028 0 0 3 table(x.fac.nascii) 100 1.616 1.086022 1.560 0.051 0 0 1 table(x.num) 100 4.504 3.026882 4.403 0.079 0 0 sessionInfo() R version 2.12.1 Patched (2011-01-18 r54033) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C LC_TIME=nn_NO.UTF-8 [4] LC_COLLATE=nn_NO.UTF-8 LC_MONETARY=C LC_MESSAGES=nn_NO.UTF-8 [7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rbenchmark_0.3 Profiling the ‘table’ function indicates almost all the time if spent in the ‘match’ function, which is used when ‘factor’ is used on a ‘factor’ inside ‘table’. Indeed, ‘x.fac.nascii = factor(x.fac.nascii)’ by itself is extremely slow. Is there any theoretical reason ‘factor’ on ‘factor’ with non-ASCII characters must be so slow? And why doesn’t this happen on Linux? Perhaps a fix for ‘table’ might be calculating the ‘table’ statistics *including* all levels (not using the ‘factor’ function anywhere), and then removing the ‘exclude’ levels in the end. For example, something along these lines: res = table.modified.to.not.use.factor(...) ind = lapply(dimnames(res), function(x) !(x %in% exclude)) do.call("[", c(list(res), ind, drop=FALSE)) (I haven’t tested this very much, so there may be issues with this way of doing things.) -- Karl Ove Hufthammer ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel