Thanks to the bug report below and S.O. question, 'allow.cartesian' is
now in 1.8.7.
Please shout if anyone spots any issues with this.
=====
New argument 'allow.cartesian' (default FALSE) added to X[Y] and
merge(X,Y), #2464.
Prevents large allocations due to misspecified joins; e.g., duplicate
key values in Y
joining to the same group in X over and over again. The word
'cartesian' is used loosely
for when more than max(nrow(X),nrow(Y)) rows would be returned. The
error message is
verbose and includes advice. Thanks to a question by Nick Clark :
http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
help from user1935457 and a detailed reproducible crash report from JR.
If the new option affects existing code you can set :
options(datatable.allow.cartesian=TRUE)
to restore the previous behaviour until you have time to address.
=====
On 10.01.2013 11:33, Matthew Dowle wrote:
Hi,
Fantastic. Thanks so much for this - same for me, yes.
It's similar to a huge cartesian join where the result
would have more than 2^31 rows. data.table should
be trapping that gracefully and giving an error
like this:
"i's key is non unique; i.e., each duplicated key value
of i will join to the same group in x over and over.
The result will be huge. Are you sure?"
Filed as bug here :
https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2464&group_id=240&atid=975
Will make it a graceful error, if I understood corectly?
Thanks!
Matthew
On 10.01.2013 10:37, J R wrote:
While investigating the following SO question
http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql
the asker ran into a segfault during a merge.
I tried to reproduce it based on his description of his data (a 4
million row table and a 1 million row table, merging on two columns,
one with 20-some unique strings and one with "+" or "-").
The following setup code:
set.seed(456)
X <- data.table(chr = sample(LETTERS, 4e6, replace=TRUE), strand =
sample(c("+","-"), 4e6, replace=TRUE), tags = as.integer(runif(4e6)
*
100), start = as.integer(runif(4e6) * 60000), end =
as.integer(runif(4e6) * 60000))
Y <- data.table(chr = sample(LETTERS, 1e6, replace=TRUE), strand =
sample(c("+","-"), 1e6, replace=TRUE), tags = as.integer(runif(1e6)
*
5), start = as.integer(runif(1e6) * 60000), end =
as.integer(runif(1e6) * 60000))
setkey(X, chr, strand)
setkey(Y, chr, strand)
Gives the following errors:
merge(X,Y)
Error in vecseq(f__, len__) : negative length vectors are not
allowed
Y[X]
Error in vecseq(f__, len__) : negative length vectors are not
allowed
In data.table 1.8.7 on Windowx x64. Doing some poking around in
debug(data.table:::`[.data.table`) makes it seems like sum(len__) >
.Machine$integer.max after the binary merge, which seems like the
above errors might come from these lines in vecseq.c:
for (i=0; i<LENGTH(len); i++) reslen += INTEGER(len)[i];
ans = PROTECT(allocVector(INTSXP, reslen));
Does that mean this size and structure and dataset is bumping up
against R's vector size limits for this type of merge?
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help