Hi Matt, There was recently another discussion on using setkey on .SD here:
http://r.789695.n4.nabble.com/setkey-on-SD-td4690283.html So the following code won't work any more in the current 1.9.3 dev version. I think the idea of using setkey in a "chain" of data.tables was nice, since it allows to set the key temporarily. The basic idea is taken from the comment here: http://stackoverflow.com/questions/22863414/using-roll-true-with-allow-cartesian-true#comment34980343_22866917 A <- data.table( x = c(1, 2, 3, 4, 5), y = letters[1:5]) B <- data.table( x = c(1, 2, 3, 1, 4), f = c("Alice", "Alice", "Alice", "Bob", "Bob"), z = 101:105) B[, setkey(.SD, x)][ , .SD[A, roll = TRUE, rollends = FALSE], by = f][ , setkey(.SD, x)] Thanks, M On 06/18/2014 01:03 AM, Matt Dowle wrote: > > Hi Ron, > > Thanks for highlighting this. Two changes now in v1.9.3 on GitHub: > > * > > |setkey| on |.SD| is now an error, rather than warnings for each > group about rebuilding the key. The new error is similar to when > attempting to use |:=| in a |.SD| subquery: |".SD is locked. Using > set*() functions on .SD is reserved for possible future use; a > tortuously flexible way to modify the original data by > group."| Thanks to Ron Hylton for highlighting the issue on > datatable-help here > > <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>. > > * > > Looping calls to |unique(DT)| such as > in |DT[,unique(.SD),by=group]| is now faster by avoiding internal > overhead of calling |[.data.table|. Thanks again to Ron Hylton for > highlighting in the same thread > > <http://r.789695.n4.nabble.com/data-table-is-asking-for-help-tp4692080.html>. > His example is reduced from 28 sec to 9 sec, with identical results. > > > I now get the following (on my slow netbook) with no changes to your code. > > print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id])) # were > warnings, now error > print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id])) # was > 28s, now 9s > print(system.time(uf <- ddply(test, .(id), conflictsFrame))) # 13s > > This just fixes the surprises, basically. Clearly Arun uses data.table > in a better way which is orders of magnitude faster. > > Matt > > > On 14/06/14 03:58, Ron Hylton wrote: >> >> Thanks, that very helpful. >> >> >> >> *From:*Arunkumar Srinivasan [mailto:[email protected]] >> *Sent:* Friday, June 13, 2014 10:46 PM >> *To:* Ron Hylton; [email protected] >> *Subject:* Re: [datatable-help] data.table is asking for help >> >> >> >> Sorry. But we can simplify it even further: >> >> The first step is just |unique(test)|. So, we can do: >> >> |system.time({| >> |ans = unique(test)| >> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]| >> |})| >> |# 0.016 0.000 0.016 | >> >> Identical? >> >> |setkey(ans)| >> |setkey(ut1)| >> |identical(ans, ut1) # [1] TRUE| >> >> >> >> Arun >> >> >> From: Arunkumar Srinivasan [email protected] >> <mailto:[email protected]> >> Reply: Arunkumar Srinivasan [email protected] >> <mailto:[email protected]> >> Date: June 14, 2014 at 4:42:31 AM >> To: Ron Hylton [email protected] <mailto:[email protected]>, >> [email protected] >> <mailto:[email protected]> >> [email protected] >> <mailto:[email protected]> >> Subject: Re: [datatable-help] data.table is asking for help >> >> >> >> A slightly simpler version of the 2nd solution is: >> >> |system.time({| >> >> |ans = test[, .N, by=names(test)]| >> >> |ans = ans[ans[, .I[.N > 1L], by=id]$V1]| >> >> |})| >> >> |# 0.019 0.000 0.019 | >> >> >> >> The answers are identical, you can check this by doing: >> >> |ans[, N := NULL]| >> >> |setkey(ans)| >> >> |setkey(ut1)| >> >> |identical(ans, ut1) # [1] TRUE| >> >> >> >> >> >> Arun >> >> >> From: Arunkumar Srinivasan [email protected] >> <mailto:[email protected]> >> Reply: Arunkumar Srinivasan [email protected] >> <mailto:[email protected]> >> Date: June 14, 2014 at 4:34:15 AM >> To: Ron Hylton [email protected] <mailto:[email protected]>, >> [email protected] >> <mailto:[email protected]> >> [email protected] >> <mailto:[email protected]> >> Subject: Re: [datatable-help] data.table is asking for help >> >> >> >> The j-expression is evaluated from within C for each group >> (unless they’re optimised with GForce - a new initiative in >> data.table). And |eval(.SD)| or |eval(anything(.SD))| is costly. >> >> You can get around it by listing the columns by yourself and >> using |.I| instead, as follows: >> >> |test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], >> by=id]$V1]| >> >> |# 0.140 0.001 0.142 | >> >> >> >> >> >> Takes about 0.14 seconds. >> >> >> ------------------------------------------------------------------------ >> >> An even faster way is: >> >> |system.time({| >> >> |ans = test[test[, .I[.N > 1], by=id]$V1] # (1) | >> >> |ans = ans[, .N, by=names(ans)] # (2) | >> >> |ans = ans[ans[, .I[.N > 1L], by=id]$V1] # (3)| >> >> |})| >> >> | | >> >> |# 0.026 0.000 0.027 | >> >> >> >> >> >> The idea for the second case is: >> >> (1) remove all entries where there’s just 1 row corresponding >> to that |id|. >> (2) Aggregate this result by all the columns now and get the >> number of rows in the column |N| (we won’t have to use this >> column though). >> (3) Now, if we aggregate by |id| and if any id has just 1 row, >> then it’d mean that that |id| has had more than 1 rows (step >> (1) filtering ensures this), but all of them are same and we >> don’t need them. So we just filter for those where .N > 1L. >> >> HTH >> >> >> >> Arun >> >> >> From: Ron Hylton [email protected] <mailto:[email protected]> >> Reply: Ron Hylton [email protected] <mailto:[email protected]> >> Date: June 14, 2014 at 3:30:55 AM >> To: [email protected] >> <mailto:[email protected]> >> [email protected] >> <mailto:[email protected]> >> Subject: Re: [datatable-help] data.table is asking for help >> >> >> >> The performance is what puzzles me; the results are >> correct so the warnings don’t matter, and not all the >> variations I’ve tried have warnings. On the real dataset >> (~800,000 rows) datatable takes about 1.5 times longer >> than dataframe + ddply. I expected it to be substantially >> faster. >> >> >> >> *From:* Arunkumar Srinivasan [mailto:[email protected]] >> *Sent:* Friday, June 13, 2014 8:57 PM >> *To:* Ron Hylton; >> [email protected] >> <mailto:[email protected]> >> *Subject:* Re: [datatable-help] data.table is asking for help >> >> >> >> However there’s another aspect. While I’m relatively >> new to R my understanding is that a function argument >> should be modifiable within the function body without >> affecting the caller, which perhaps conflicts with the >> behavior of .SD. >> >> `data.table` is designed for working with *really large* >> data sets in mind (> 100 or 200 GB in memory even). And >> therefore, as a design feature, it trades in "referential >> transparency" for manipulating data objects *as efficient >> as possible* in terms of both *speed* and *memory usage* >> (most of the times they go hand-in-hand). >> >> This is perhaps the biggest design choice one needs to be >> aware of when working/choosing data.tables. It is possible >> to modify objects by reference using data.table - All the >> functions that begin with "set*" modify objects by >> reference. The only other non "set*" function is `:=` >> operator. >> >> >> >> HTH >> >> Arun >> >> >> From: Ron Hylton [email protected] >> <mailto:[email protected]> >> Reply: Ron Hylton [email protected] >> <mailto:[email protected]> >> Date: June 14, 2014 at 2:52:04 AM >> To: [email protected] >> <mailto:[email protected]> >> [email protected] >> <mailto:[email protected]> >> Subject: Re: [datatable-help] data.table is asking for help >> >> >> >> I suspected it was something like this. As one >> clarification, there is a setkey(test,id) before any >> setkey(.SD). If setkey(test,id) is changed to >> setkey(test) so all columns are in the original >> datatable key then the warning goes away. >> >> >> >> However there’s another aspect. While I’m relatively >> new to R my understanding is that a function argument >> should be modifiable within the function body without >> affecting the caller, which perhaps conflicts with the >> behavior of .SD. >> >> >> >> *From:* Arunkumar Srinivasan >> [mailto:[email protected]] >> *Sent:* Friday, June 13, 2014 8:23 PM >> *To:* Ron Hylton; >> [email protected] >> <mailto:[email protected]> >> *Subject:* Re: [datatable-help] data.table is asking >> for help >> >> >> >> Nicely reproducible post. Reproducible in v1.9.3 >> (latest commit) as well. >> >> This is a tricky one. It happens because you’re >> setting key on |.SD| which should normally not be >> allowed. What happens is, when you set key the first >> time, there’s no key set (here) and therefore key is >> set on all the columns |x1|, |x2| and |x3|. >> >> Now, the next group (in the |by=.|) is passed to your >> function, it’ll have the |key| already set to >> |x1,x2,x3| (because |setkey| modifies the object by >> reference), but |.SD| has obtained *new* data >> corresponding to /this/ group. And |data.table| sorts >> this data, knowing that it already has key set.. but >> if the key is set then the order must be 1:n. But it >> wouldn’t be, as this data isn’t sorted. |data.table| >> warns in those scenarios.. and that’s why you get the >> warning. >> >> To verify this, you can try: >> >> |conflictsTable1 <- function(f, address) {| >> >> | u <- unique(setkey(f))| >> >> | setattr(f, 'sorted', NULL)| >> >> | if (nrow(u) == 1) return(NULL)| >> >> | u| >> >> |}| >> >> Basically, we set the key of |f| (which is equal to >> |.SD| as it’s only modified by reference) to |NULL| >> everytime after.. so that |.SD| for the new group will >> not have the key set. >> >> The ideal scenario here, IIUC, is that |setkey(.SD)| >> or things pointing to |.SD| should not be possible >> (locking binding doesn’t seem to affect things done by >> reference..). |.SD| however should retain the key of >> the data.table, if a key was set, wherever possible. >> >> >> >> Arun >> >> >> From: Ron Hylton [email protected] >> <mailto:[email protected]> >> Reply: Ron Hylton [email protected] >> <mailto:[email protected]> >> Date: June 14, 2014 at 1:55:53 AM >> To: [email protected] >> <mailto:[email protected]> >> [email protected] >> <mailto:[email protected]> >> Subject: [datatable-help] data.table is asking for help >> >> >> >> The code below generates the warning: >> >> >> >> In setkeyv(x, cols, verbose = verbose) : >> >> Already keyed by this key but had invalid row >> order, key rebuilt. If you didn't go under the >> hood please let datatable-help know so the root >> cause can be fixed. >> >> >> >> This is my first attempt at using datatable so I >> probably did something dumb, but maybe that‘s >> useful for someone. The first case is the one >> that gives the warnings. >> >> >> >> I’m also surprised at the timings. I wrote the >> original algorithm using dataframe & ddply and I >> expected datatable to be substantially faster; the >> opposite is true. >> >> >> >> The algorithm does the following: Certain columns >> in the table are keys and others are values in the >> sense that each row with the same set of keys >> should have the same set of values. Find all the >> key sets for which this is not true and return the >> keys sets + conflicting value sets. >> >> >> >> Insight into the performance would be appreciated. >> >> >> >> Regards, >> >> Ron >> >> >> >> library(data.table) >> >> library(plyr) >> >> >> >> conflictsTable1 <- function(f) { >> >> u <- unique(setkey(f)) >> >> if (nrow(u) == 1) return(NULL) >> >> u >> >> } >> >> >> >> conflictsTable2 <- function(f) { >> >> u <- unique(f) >> >> if (nrow(u) == 1) return(NULL) >> >> u >> >> } >> >> >> >> conflictsFrame <- function(f) { >> >> u <- unique(f) >> >> if (nrow(u) == 1) return(NULL) >> >> u >> >> } >> >> >> >> N <- 10000 >> >> test <- >> >> data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), >> x1=rnorm(N), x2=rnorm(N), x3=rnorm(N)) >> >> >> >> setkey(test,id) >> >> >> >> print(system.time(ut1 <- test[, >> conflictsTable1(.SD), by=id])) >> >> >> >> print(system.time(ut2 <- test[, >> conflictsTable2(.SD), by=id])) >> >> >> >> print(system.time(uf <- ddply(test, .(id), >> conflictsFrame))) >> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> <mailto:[email protected]> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> <mailto:[email protected]> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> <mailto:[email protected]> >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
