Jeremiah, Thanks. Just a few hours ago, I answered a similar question to a post from Ron (pasted below):
`data.table` is designed for working with *really large* data sets in mind (> 100 or 200 GB in memory even). And therefore, as a design feature, it trades in "referential transparency" for manipulating data objects *as efficient as possible* in terms of both *speed* and *memory usage* (most of the times they go hand-in-hand). This is perhaps the biggest design choice one needs to be aware of when working/choosing data.tables. It is possible to modify objects by reference using data.table - All the functions that begin with "set*" modify objects by reference. The only other non "set*" function is `:=` operator. There’s a pending feature request on adding this point (on explicit copy) to the FAQs, which we’ve not gotten to, yet. To our knowledge, people do overcome this difference quite quickly. It’s not necessary to know about pointers to understand that the object gets modified in-place. I’m not a python user at all, but recently came to know that this is also a feature there: https://docs.python.org/2/library/copy.html But point taken. That explicit copy will be required will be added to the FAQs. Arun From: jeremiah rounds [email protected] Reply: jeremiah rounds [email protected] Date: June 14, 2014 at 7:23:22 AM To: [email protected] [email protected] Subject: [datatable-help] Are you aware of this? As a fan of your work I have always been curious if you are aware of this? I find it causes new users to make mistakes. > dt = list() > dt$x = 1:10 > dt$y = letters[10:1] > dt = as.data.table(as.data.frame(dt)) > dt x y 1: 1 j 2: 2 i 3: 3 h 4: 4 g 5: 5 f 6: 6 e 7: 7 d 8: 8 c 9: 9 b 10: 10 a > x0 = dt$x > x1 = dt$x > x0[1] = 11 > setkeyv(dt,"y") > x0 [1] 11 2 3 4 5 6 7 8 9 10 > x1 [1] 10 9 8 7 6 5 4 3 2 1 > x1 == x0 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE x0 and x1 have assignments at the same exact time, and since R data.frame's will not do this, it lures people into thinking they are then identical and distinct as they are with data.frame's. My theory is they are not actually copied: they are promised. When x0 has its index 1 changed it induces a copy distinct from dt$x, but x1 has had no operation on it so it refers to dt$x with its promise. Setting the key on dt reorders it and since x1 still hasn't been evaluated it now matches the order of dt. I found new users getting unpredictable results because they would try to use a data.table as a data.frame and induce this with sorts. If you thought you copied something in a particular order in dt by doing the assigning ahead of the setkeyv you make a mistake. You don't really expect x1 assigned maybe a page of code above to have its order changed by a setkeyv. You do if you think about C pointers and references, but in R you really don't think that way. Many R users don't even know what a pointer is. Thanks, Jeremiah > sessionInfo() R version 3.0.1 (2013-05-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] locfit_1.5-9.1 edgeR_3.4.2 limma_3.18.13 [4] data.table_1.9.2 GenomicRanges_1.14.4 XVector_0.2.0 [7] IRanges_1.20.7 BiocGenerics_0.8.0 loaded via a namespace (and not attached): [1] grid_3.0.1 lattice_0.20-15 plyr_1.8.1 Rcpp_0.11.1 [5] reshape2_1.4 stats4_3.0.1 stringr_0.6.2 tools_3.0.1 _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
