Resending, now that I'm on this list. ---------- Forwarded message ---------- From: Tim Hesterberg <[email protected]> Date: Tue, May 22, 2012 at 9:58 AM Subject: Comments on data.table To: Matthew Dowle <[email protected]> Cc: [email protected], Chris Neff < [email protected]>
Hi Matthew, Here are some comments I had about data.table. In the next message I'll forward Chris Neff's response. Tim Hesterberg ---------- Forwarded message ---------- From: Tim Hesterberg <[email protected]> Date: Sun, Mar 25, 2012 at 10:13 PM Subject: Re: [R-users] faster aggregate To: Chris Neff <[email protected]> Hi Chris, Old thread, just responding to you. I finally started looking seriously at data.table, in response to your posts. I'm thinking about supporting data.table in the aggregate package, and about incorporating one of the nice features you've mentioned into aggregate, namely making it easier to get results for some columns of an existing data.frame (or data.table) without copying. My preliminary impression is a combination of (a) Cool! (b) Nicely implemented; I did benchmarks of memory allocations for regular data.frame code, my dataframe package, data.table, and the combination of dataframe and data.table. dataframe is dramatically better than regular R, data.table is substantially better yet, and the combination of both is slightly better yet. (c) Sheer horror and frustration. Horror at one dangerous design decision. Frustration that some relatively small changes in the package would make the learning curve much shallower, so this package could be used more widely, and make its use safer. Take this with a grain of salt - I haven't used the package enough yet, maybe I would change my mind about these points. But I'll share them with you now. I'll give this some time to settle, and try the package more, before sharing these with the author. (1) Using the second argument to [.data.table for calculating expressions instead of subscripting. The inconsistency between [.data.table and [.data.frame increases the learning curve dramatically, and makes for bugs. The first argument is also unusual, but in a way that I think makes more sense. I suggest using a different function for evaluating expressions, in particular, with.data.table(x, expr, additional arguments) Then syntax would be: Current Using with.data.table DT[, expr] with(DT, expr) DT[K, expr] with(DT[K,], expr) or with(DT, expr, subset=K) DT[, expr, by=foo] with(DT, expr, by=foo) DT[, list(expr1,expr2)] with(DT, J(expr1, expr2)) ?not possible now? with(DT, list(mean(x), quantile(x)) Note that I would not use list(expr1, expr2), but rather explicitly use data.table(expr1, expr2) or J(expr1, expr2), when someone wants a data.table returned. This makes it easier to look at the code and see what is to be returned. The inconsistency with normal usage of [ in R also raises questions for [<-.data.table. Is this consistent with [.data.table or [<-.data.frame? (I haven't explored this yet. [<-.data.table is not documented.) (2) Having setkey modify the object in place. This means that one cannot look for <- (or =) to determine when an object is modified. It would be safe to instead do key(x) <- character vector of key names As implemented, if you pass a dt to a function and modify the key there, the modification also affects the original object. And, you end up with two copies of the object with different names, and modifying one changes the other. # Test if setkey called within a function causes problems. x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a") foo <- function(y){ setkey(y, b) y } z <- foo(x) tables() # now x also has key b, not a. setkey(z, "a") tables() # z and x both have key a # Even copying the object without calling a function makes two pointers x <- data.table(a=c(1,1,2), b=c(3,4,4), key="a") y <- x setkey(y, b) tables() # both x and y have key "b" (3) Expecting unquoted names where people would normally expect to give quoted names, like setkey. (4) Not allowing character data to remain character. (I deleted earlier messages on the thread. Some of that is relevant, but some of it may be confidential.)
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
