With the change to 'factor',
factor(1L, levels = TRUE)
doesn't give NA, different from
factor(1, levels = TRUE)
With the change to 'factor',
factor(TRUE, levels = 1L)
and
factor(TRUE, levels = 1)
don't give NA.
With the change to 'factor',
factor(2L, levels = sqrt(2)^2)
gives NA, different from
factor(2, levels = sqrt(2)^2)
With the change to 'factor',
factor(2L, exclude = sqrt(2)^2)
has 1 level (nothing is excluded), different from
factor(2, exclude = sqrt(2)^2)
------------
Am 21.03.25 um 15:42 schrieb Aidan Lakshman via R-devel:
After investigating the source of table, I ended up on the reason being
“as.character()”:
This is specifically happening within the conversion of the input to type
factor, which is where the as.character conversion happens.
Yes, I also think 'factor' could do a bit better for unclassed integers
(such as when called from 'cut') as well as for logical input (such as
from 'summary' -> 'table').
Note that 'as.factor' already has a "fast track" for plain integers
(originally for 'split.default' from 'tapply'), so can be used instead
of 'factor' when there is no need for custom 'levels', 'labels', or
'exclude'. (Thanks for already mentioning 'tabulate'.)
A 'factor' patch would apply more broadly, e.g.:
===================================================================
--- src/library/base/R/factor.R (Revision 88042)
+++ src/library/base/R/factor.R (Arbeitskopie)
@@ -20,14 +20,18 @@
exclude = NA, ordered = is.ordered(x), nmax = NA)
{
if(is.null(x)) x <- character()
+ directmatch <- !is.object(x) &&
+ (is.character(x) || is.integer(x) || is.logical(x))
nx <- names(x)
if (missing(levels)) {
y <- unique(x, nmax = nmax)
ind <- order(y)
- levels <- unique(as.character(y)[ind])
+ if (!directmatch)
+ y <- as.character(y)
+ levels <- unique(y[ind])
}
force(ordered) # check if original x is an ordered factor
- if(!is.character(x))
+ if(!directmatch)
x <- as.character(x)
## levels could be a long vector, but match will not handle that.
levels <- levels[is.na(match(levels, exclude))]
f <- match(x, levels)
===================================================================
This skips as.character() also for integer/logical 'x' and would indeed
bring table() runtimes "in order":
set.seed(1)
C <- sample(c("no", "yes"), 10^7, replace = TRUE)
F <- as.factor(C)
L <- F == "yes"
I <- as.integer(L)
N <- as.numeric(I)
## Median system.time(table(.)) in ms:
## table(F) 256
## table(I) 384 # not 696
## table(L) 409 # not 1159
## table(C) 591
## table(N) 3324
The (seemingly) small patch passes check-all, but maybe it overlooks
some edge cases. I'd test it on a subset of CRAN/BIOC packages.
Best,
Sebastian Meyer
# Timing is all on my local machine (OSX)
N_v <- sample(c(1,0), 10^7, replace = TRUE)
L_v <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
# user system elapsed
system.time(table(N_v)) # 2.155 0.039 2.192
system.time(table(L_v)) # 0.806 0.030 0.838
system.time(N_fv <- as.factor(N_v)) # 2.026 0.024 2.050
system.time(L_fv <- as.factor(L_v)) # 0.668 0.015 0.683
system.time(table(N_fv)) # 0.133 0.022 0.156
system.time(table(L_fv)) # 0.134 0.018 0.151
The performance for Integers and specially booleans is quite surprising.
Of note is that the performance is significantly better if using `tabulate`,
since this doesn't involve a conversion to factor (though input must be
numeric/factor, results aren't named, and it has worse handling of NA values).
If you have performance critical calls like this you could consider using
`tabulate` instead.
system.time(tabulate(N_v)) # 0.054 0.002 0.056
system.time(tabulate(as.integer(L_v))) # 0.052 0.002 0.055
I don't know if this is a known issue or not; most of my colleagues are aware
of the slow-down and use `tabulate` when performance is required. My
understanding was that the slower performance is a trade-off for more
consistent performance (better output, better handling of ambiguities/NA,
etc.), and that speed isn't the highest priority with `table`. Maybe someone
else has a better understanding of the history of the function.
As for improving the speed, it would basically come down to refactoring `table`
to not use a `factor` conversion. I'd be concerned about introducing a lot of
edge cases with that, but it's theoretically possible. Based on 30 seconds of
thinking, it may be possible to do something like:
## just a sketch of a barebones non-factor implementation
test_tab <- function(x){
lookup <- unique(x)
counts <- tabulate(match(x, lookup))
names(counts) <- as.character(lookup)
counts
}
system.time(test_tab(L_v)) # 0.101 0.006 0.107
system.time(test_tab(N_v)) # 0.129 0.015 0.144
This is also faster in the case where there are lots of categories with few
entries per category:
N_v2 <- 1:1e7
system.time(test_tab(N_v2)) # 0.383 0.024 0.411
system.time(table(N_v2)) # 6.122 0.228 6.398
Obviously there are some big shortcomings:
- it's missing a lot of error checking etc. that the standard `table` has
- it only works with 1D vectors
- NA handling isn't quite the same as `table` (though it would be easy to adapt)
Just including to potentially start discussion for optimization.
For reference, the relevant section is in src/library/base/R/table.R:L75-85
-Aidan
-----------------------
Aidan Lakshman (he/him)
http://www.ahl27.com/
On 21 Mar 2025, at 8:26, Karolis Koncevičius wrote:
[You don't often get email from karolis.koncevicius using gmail.com. Learn why
this is important at https://aka.ms/LearnAboutSenderIdentification ]
I was calling table() on some long logical vectors and noticed that it took a
long time.
Out of curiosity I checked the performance of table() on different types, and
had some unexpected results:
C <- sample(c("yes", "no"), 10^7, replace = TRUE)
F <- factor(sample(c("yes", "no"), 10^7, replace = TRUE))
N <- sample(c(1,0), 10^7, replace = TRUE)
I <- sample(c(1L,0L), 10^7, replace = TRUE)
L <- sample(c(TRUE, FALSE), 10^7, replace = TRUE)
# ordered by execution time
# user system elapsed
system.time(table(F)) # 0.088 0.006 0.093
system.time(table(C)) # 0.208 0.017 0.224
system.time(table(I)) # 0.242 0.019 0.261
system.time(table(L)) # 0.665 0.015 0.680
system.time(table(N)) # 1.771 0.019 1.791
The performance for Integers and specially booleans is quite surprising.
After investigating the source of table, I ended up on the reason being
“as.character()”:
system.time(as.character(L))
user system elapsed
0.461 0.002 0.462
Even a manual conversion can achieve a speed-up by a factor of ~7:
system.time(c("FALSE", "TRUE")[L+1])
user system elapsed
0.061 0.006 0.067
Tested on 4.4.3 as well as devel trunk.
Just reporting for comments and attention.
Karolis K.
______________________________________________
R-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel using r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel