I have one little nitpick that may be important as you write documentation. In 796, the global option doesn't have the second period in it:
$datatable.allowcartesian [1] FALSE On Tue, Jan 15, 2013 at 3:00 PM, Matthew Dowle <[email protected]> wrote: > > Thanks to the bug report below and S.O. question, 'allow.cartesian' is now > in 1.8.7. > Please shout if anyone spots any issues with this. > > ===== > New argument 'allow.cartesian' (default FALSE) added to X[Y] and merge(X,Y), > #2464. > Prevents large allocations due to misspecified joins; e.g., duplicate key > values in Y > joining to the same group in X over and over again. The word 'cartesian' is > used loosely > for when more than max(nrow(X),nrow(Y)) rows would be returned. The error > message is > verbose and includes advice. Thanks to a question by Nick Clark : > > http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql > help from user1935457 and a detailed reproducible crash report from JR. > If the new option affects existing code you can set : > options(datatable.allow.cartesian=TRUE) > to restore the previous behaviour until you have time to address. > ===== > > > > On 10.01.2013 11:33, Matthew Dowle wrote: >> >> Hi, >> >> Fantastic. Thanks so much for this - same for me, yes. >> >> It's similar to a huge cartesian join where the result >> would have more than 2^31 rows. data.table should >> be trapping that gracefully and giving an error >> like this: >> >> "i's key is non unique; i.e., each duplicated key value >> of i will join to the same group in x over and over. >> The result will be huge. Are you sure?" >> >> Filed as bug here : >> >> >> https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2464&group_id=240&atid=975 >> >> Will make it a graceful error, if I understood corectly? >> >> Thanks! >> Matthew >> >> >> On 10.01.2013 10:37, J R wrote: >>> >>> While investigating the following SO question >>> >>> >>> >>> http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql >>> >>> the asker ran into a segfault during a merge. >>> >>> I tried to reproduce it based on his description of his data (a 4 >>> million row table and a 1 million row table, merging on two columns, >>> one with 20-some unique strings and one with "+" or "-"). >>> >>> The following setup code: >>> >>> set.seed(456) >>> X <- data.table(chr = sample(LETTERS, 4e6, replace=TRUE), strand = >>> sample(c("+","-"), 4e6, replace=TRUE), tags = as.integer(runif(4e6) * >>> 100), start = as.integer(runif(4e6) * 60000), end = >>> as.integer(runif(4e6) * 60000)) >>> Y <- data.table(chr = sample(LETTERS, 1e6, replace=TRUE), strand = >>> sample(c("+","-"), 1e6, replace=TRUE), tags = as.integer(runif(1e6) * >>> 5), start = as.integer(runif(1e6) * 60000), end = >>> as.integer(runif(1e6) * 60000)) >>> setkey(X, chr, strand) >>> setkey(Y, chr, strand) >>> >>> Gives the following errors: >>> >>>> merge(X,Y) >>> >>> Error in vecseq(f__, len__) : negative length vectors are not allowed >>>> >>>> Y[X] >>> >>> Error in vecseq(f__, len__) : negative length vectors are not allowed >>> >>> In data.table 1.8.7 on Windowx x64. Doing some poking around in >>> debug(data.table:::`[.data.table`) makes it seems like sum(len__) > >>> .Machine$integer.max after the binary merge, which seems like the >>> above errors might come from these lines in vecseq.c: >>> >>> for (i=0; i<LENGTH(len); i++) reslen += INTEGER(len)[i]; >>> ans = PROTECT(allocVector(INTSXP, reslen)); >>> >>> Does that mean this size and structure and dataset is bumping up >>> against R's vector size limits for this type of merge? >>> _______________________________________________ >>> datatable-help mailing list >>> [email protected] >>> >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
