Re: [Rd] Inefficiency in df$col

Duncan Murdoch Sun, 03 Feb 2019 18:36:09 -0800

On 03/02/2019 12:04 p.m., Radford Neal wrote:

While doing some performance testing with the new version of pqR (see
pqR-project.org), I've encountered an extreme, and quite unnecessary,
inefficiency in the current R Core implementation of R, which I think
you might want to correct.


The inefficiency is in access to columns of a data frame, as in
expressions such as df$col[i], which I think are very common (the
alternatives of df[i,"col"] and df[["col"]][i] are, I think, less
common).

Here is the setup for an example showing the issue:

   L <- list (abc=1:9, xyz=11:19)
   Lc <- L; class(Lc) <- "glub"
   df <- data.frame(L)

And here are some times for R-3.5.2 (r-devel of 2019-02-01 is much
the same):

   > system.time (for (i in 1:1000000) r <- L$xyz)
      user  system elapsed
     0.086   0.004   0.089
   > system.time (for (i in 1:1000000) r <- Lc$xyz)
      user  system elapsed
     0.494   0.000   0.495
   > system.time (for (i in 1:1000000) r <- df$xyz)
      user  system elapsed
     3.425   0.000   3.426

So accessing a column of a data frame is 38 times slower than
accessing a list element (which is what happens in the underlying
implementation of a data frame), and 7 times slower than accessing an
element of a list with a class attribute (for which it's necessary to
check whether there is a $.glub method, which there isn't here).

For comparison, here are the times for pqR-2019-01-25:

   > system.time (for (i in 1:1000000) r <- L$xyz)
      user  system elapsed
     0.057   0.000   0.058
   > system.time (for (i in 1:1000000) r <- Lc$xyz)
      user  system elapsed
     0.251   0.000   0.251
   > system.time (for (i in 1:1000000) r <- df$xyz)
      user  system elapsed
     0.247   0.000   0.247

So when accessing df$xyz, R-3.5.2 is 14 times slower than pqR-2019-01-25.
(For a partial match, like df$xy, R-3.5.2 is 34 times slower.)

I wasn't surprised that pqR was faster, but I didn't expect this big a
difference.  Then I remembered having seen a NEWS item from R-3.1.0:

   * Partial matching when using the $ operator _on data frames_ now
     throws a warning and may become defunct in the future. If partial
     matching is intended, replace foo$bar by foo[["bar", exact =
     FALSE]].

and having looked at the code then:

   `$.data.frame` <- function(x,name) {
     a <- x[[name]]
     if (!is.null(a)) return(a)

a <- x[[name, exact=FALSE]]

     if (!is.null(a)) warning("Name partially matched in data frame")
     return(a)
   }

I recall thinking at the time that this involved a pretty big
performance hit, compared to letting the primitive $ operator do it,
just to produce a warning.  But it wasn't until now that I noticed
this NEWS in R-3.1.1:

   * The warning when using partial matching with the $ operator on
     data frames is now only given when
     options("warnPartialMatchDollar") is TRUE.

for which the code was changed to:

   `$.data.frame` <- function(x,name) {
     a <- x[[name]]
     if (!is.null(a)) return(a)

a <- x[[name, exact=FALSE]]

     if (!is.null(a) && getOption("warnPartialMatchDollar", default=FALSE)) {
           names <- names(x)
           warning(gettextf("Partial match of '%s' to '%s' in data frame",
                                      name, names[pmatch(name, names)]))
     }
     return(a)
   }

One can see the effect now when warnPartialMatchDollar is enabled:

   > options(warnPartialMatchDollar=TRUE)
   > Lc$xy
   [1] 11 12 13 14 15 16 17 18 19
   Warning message:
   In Lc$xy : partial match of 'xy' to 'xyz'
   > df$xy
   [1] 11 12 13 14 15 16 17 18 19
   Warning message:
   In `$.data.frame`(df, xy) : Partial match of 'xy' to 'xyz' in data frame

So the only thing that slowing down acesses like df$xyz by a factor of
seven achieves now is to add the words "in data frame" to the warning
message (while making the earlier part of the message less intelligible).

I think you might want to just delete the definition of $.data.frame,
reverting to the situation before R-3.1.0.

I imagine the cause is that the list version is done in C code ratherthan R code (i.e. there's no R function `$.list`). So an alternativesolution would be to also implement `$.data.frame` in the underlying Ccode. This won't be quite as fast (it needs that test for NULL), butshould be close in the full match case.


Duncan Murdoch

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Inefficiency in df$col

Reply via email to