A slightly simpler version of the 2nd solution is:

system.time({
ans = test[, .N, by=names(test)]
ans = ans[ans[, .I[.N > 1L], by=id]$V1]
})
#  0.019   0.000   0.019  
The answers are identical, you can check this by doing:

ans[, N := NULL]
setkey(ans)
setkey(ut1)
identical(ans, ut1) # [1] TRUE

Arun

From: Arunkumar Srinivasan [email protected]
Reply: Arunkumar Srinivasan [email protected]
Date: June 14, 2014 at 4:34:15 AM
To: Ron Hylton [email protected], [email protected] 
[email protected]
Subject:  Re: [datatable-help] data.table is asking for help  

The j-expression is evaluated from within C for each group (unless they’re 
optimised with GForce - a new initiative in data.table). And eval(.SD) or 
eval(anything(.SD)) is costly.

You can get around it by listing the columns by yourself and using .I instead, 
as follows:

test[test[, .I[length(unique(list(x1,x2,x3))[[1L]]) > 1L], by=id]$V1]
#  0.140   0.001   0.142   

Takes about 0.14 seconds.

An even faster way is:

system.time({
ans = test[test[, .I[.N > 1], by=id]$V1]        # (1)   
ans = ans[, .N, by=names(ans)]                  # (2)   
ans = ans[ans[, .I[.N > 1L], by=id]$V1]         # (3)
})

#  0.026   0.000   0.027   

The idea for the second case is:

(1) remove all entries where there’s just 1 row corresponding to that id.
(2) Aggregate this result by all the columns now and get the number of rows in 
the column N (we won’t have to use this column though).
(3) Now, if we aggregate by id and if any id has just 1 row, then it’d mean 
that that id has had more than 1 rows (step (1) filtering ensures this), but 
all of them are same and we don’t need them. So we just filter for those where 
.N > 1L.

HTH


Arun

From: Ron Hylton [email protected]
Reply: Ron Hylton [email protected]
Date: June 14, 2014 at 3:30:55 AM
To: [email protected] 
[email protected]
Subject:  Re: [datatable-help] data.table is asking for help

The performance is what puzzles me; the results are correct so the warnings 
don’t matter, and not all the variations I’ve tried have warnings.  On the real 
dataset (~800,000 rows) datatable takes about 1.5 times longer than dataframe + 
ddply.  I expected it to be substantially faster.

 

From: Arunkumar Srinivasan [mailto:[email protected]]
Sent: Friday, June 13, 2014 8:57 PM
To: Ron Hylton; [email protected]
Subject: Re: [datatable-help] data.table is asking for help

 

However there’s another aspect.  While I’m relatively new to R my understanding 
is that a function argument should be modifiable within the function body 
without affecting the caller, which perhaps conflicts with the behavior of .SD.

`data.table` is designed for working with *really large* data sets in mind (> 
100 or 200 GB in memory even). And therefore, as a design feature, it trades in 
"referential transparency" for manipulating data objects *as efficient as 
possible* in terms of both *speed* and *memory usage* (most of the times they 
go hand-in-hand).

This is perhaps the biggest design choice one needs to be aware of when 
working/choosing data.tables. It is possible to modify objects by reference 
using data.table - All the functions that begin with "set*" modify objects by 
reference. The only other non "set*" function is `:=` operator.

 

HTH

Arun


From: Ron Hylton [email protected]
Reply: Ron Hylton [email protected]
Date: June 14, 2014 at 2:52:04 AM
To: [email protected] 
[email protected]
Subject:  Re: [datatable-help] data.table is asking for help




I suspected it was something like this.  As one clarification, there is a 
setkey(test,id) before any setkey(.SD).   If setkey(test,id) is changed to 
setkey(test) so all columns are in the original datatable key then the warning 
goes away.

 

However there’s another aspect.  While I’m relatively new to R my understanding 
is that a function argument should be modifiable within the function body 
without affecting the caller, which perhaps conflicts with the behavior of .SD.

 

From: Arunkumar Srinivasan [mailto:[email protected]]
Sent: Friday, June 13, 2014 8:23 PM
To: Ron Hylton; [email protected]
Subject: Re: [datatable-help] data.table is asking for help

 

Nicely reproducible post. Reproducible in v1.9.3 (latest commit) as well.

This is a tricky one. It happens because you’re setting key on .SD which should 
normally not be allowed. What happens is, when you set key the first time, 
there’s no key set (here) and therefore key is set on all the columns x1, x2 
and x3.

Now, the next group (in the by=.) is passed to your function, it’ll have the 
key already set to x1,x2,x3 (because setkey modifies the object by reference), 
but .SD has obtained new data corresponding to this group. And data.table sorts 
this data, knowing that it already has key set.. but if the key is set then the 
order must be 1:n. But it wouldn’t be, as this data isn’t sorted. data.table 
warns in those scenarios.. and that’s why you get the warning.

To verify this, you can try:

conflictsTable1 <- function(f, address) {
  u <- unique(setkey(f))
  setattr(f, 'sorted', NULL)
  if (nrow(u) == 1) return(NULL)
  u
}
Basically, we set the key of f (which is equal to .SD as it’s only modified by 
reference) to NULL everytime after.. so that .SD for the new group will not 
have the key set.

The ideal scenario here, IIUC, is that setkey(.SD) or things pointing to .SD 
should not be possible (locking binding doesn’t seem to affect things done by 
reference..). .SD however should retain the key of the data.table, if a key was 
set, wherever possible.

 

Arun


From: Ron Hylton [email protected]
Reply: Ron Hylton [email protected]
Date: June 14, 2014 at 1:55:53 AM
To: [email protected] 
[email protected]
Subject:  [datatable-help] data.table is asking for help

 

The code below generates the warning:

 

In setkeyv(x, cols, verbose = verbose) :

  Already keyed by this key but had invalid row order, key rebuilt. If you 
didn't go under the hood please let datatable-help know so the root cause can 
be fixed.

 

This is my first attempt at using datatable so I probably did something dumb, 
but maybe that‘s useful for someone.  The first case is the one that gives the 
warnings.

 

I’m also surprised at the timings.  I wrote the original algorithm using 
dataframe & ddply and I expected datatable to be substantially faster; the 
opposite is true.

 

The algorithm does the following:  Certain columns in the table are keys and 
others are values in the sense that each row with the same set of keys should 
have the same set of values.  Find all the key sets for which this is not true 
and return the keys sets + conflicting value sets.

 

Insight into the performance would be appreciated.

 

Regards,

Ron

 

library(data.table)

library(plyr)

 

conflictsTable1 <- function(f) {

  u <- unique(setkey(f))

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsTable2 <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

conflictsFrame <- function(f) {

  u <- unique(f)

  if (nrow(u) == 1) return(NULL)

  u

}

 

N <- 10000

test <- data.table(id=as.character(10000*sample(1:N,N,replace=TRUE)), 
x1=rnorm(N), x2=rnorm(N), x3=rnorm(N))

 

setkey(test,id)

 

print(system.time(ut1 <- test[, conflictsTable1(.SD), by=id]))

 

print(system.time(ut2 <- test[, conflictsTable2(.SD), by=id]))

 

print(system.time(uf <- ddply(test, .(id), conflictsFrame)))

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to