Re: [datatable-help] Advance warning

Matthew Dowle Wed, 16 Jan 2013 16:58:40 -0800

It was pretty clear so I just changed it (797) to:datatable.allow.cartesian.

Where the option corresponds to an argument name, the option is"datatable.<exact.arg.name>", then.


On 17.01.2013 00:21, Matthew Dowle wrote:

Thanks. Commit 796 corrected that typo in NEWS just after I sent the
email below.  So the global option was intended to be
$datatable.allowcartesian as coded and documented. I grep'd code and
man pages just in case and seems ok.

Rightly or wrongly for the global options we dropped the dot in
datatable, followed by dot, followed by the argument/option with dots

dropped. That's consistent across the 9 global options ..... oh no,it

isn't .. other than print.nrows and print.topn.  Darn it, those
slipped through.

All the options :

datatable.verbose            = FALSE
datatable.dfdispatchwarn     = TRUE
datatable.alloccol           = quote(max(100,2*ncol(DT)))
datatable.nomatch            = NA_integer_
datatable.optimize           = Inf
datatable.print.nrows        = 100L
datatable.print.topn         = 5L
datatable.warnredundantby    = TRUE
datatable.allowcartesian     = FALSE

Could we get away with dropping the 2nd dots in print.nrows and
print.topn I wonder?

Or, it could be "datatable.allow.cartesian" if that's what most
people would expect then?  warnredundantby and dfdispatchwarn aren't
argument names to functions, and datatable.alloccol actually
corresponds to n of alloc.col.

So datatable.allow.cartesian then?



On 16.01.2013 19:57, J R wrote:

I have one little nitpick that may be important as you write
documentation.  In 796, the global option doesn't have the second
period in it:

$datatable.allowcartesian
[1] FALSE


On Tue, Jan 15, 2013 at 3:00 PM, Matthew Dowle
<[email protected]> wrote:

Thanks to the bug report below and S.O. question, 'allow.cartesian'is now

in 1.8.7.
Please shout if anyone spots any issues with this.

=====

New argument 'allow.cartesian' (default FALSE) added to X[Y] andmerge(X,Y),

#2464.

Prevents large allocations due to misspecified joins; e.g.,duplicate key

values in Y

joining to the same group in X over and over again. The word'cartesian' is

used loosely

for when more than max(nrow(X),nrow(Y)) rows would be returned. Theerror

message is
verbose and includes advice. Thanks to a question by Nick Clark :


http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql

help from user1935457 and a detailed reproducible crash report fromJR.

If the new option affects existing code you can set :
  options(datatable.allow.cartesian=TRUE)
to restore the previous behaviour until you have time to address.
=====



On 10.01.2013 11:33, Matthew Dowle wrote:


Hi,

Fantastic. Thanks so much for this - same for me, yes.

It's similar to a huge cartesian join where the result
would have more than 2^31 rows. data.table should
be trapping that gracefully and giving an error
like this:

"i's key is non unique; i.e., each duplicated key value
of i will join to the same group in x over and over.
The result will be huge. Are you sure?"

Filed as bug here :



https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2464&group_id=240&atid=975

Will make it a graceful error, if I understood corectly?

Thanks!
Matthew


On 10.01.2013 10:37, J R wrote:


While investigating the following SO question




http://stackoverflow.com/questions/14231737/greatest-n-per-group-reference-with-intervals-in-r-or-sql

the asker ran into a segfault during a merge.

I tried to reproduce it based on his description of his data (a 4

million row table and a 1 million row table, merging on twocolumns,

one with 20-some unique strings and one with "+" or "-").

The following setup code:

set.seed(456)

X <- data.table(chr = sample(LETTERS, 4e6, replace=TRUE), strand=sample(c("+","-"), 4e6, replace=TRUE), tags =as.integer(runif(4e6) *

100), start = as.integer(runif(4e6) * 60000), end =
as.integer(runif(4e6) * 60000))

Y <- data.table(chr = sample(LETTERS, 1e6, replace=TRUE), strand=sample(c("+","-"), 1e6, replace=TRUE), tags =as.integer(runif(1e6) *

5), start = as.integer(runif(1e6) * 60000), end =
as.integer(runif(1e6) * 60000))
setkey(X, chr, strand)
setkey(Y, chr, strand)

Gives the following errors:

merge(X,Y)

Error in vecseq(f__, len__) : negative length vectors are notallowed


Y[X]

Error in vecseq(f__, len__) : negative length vectors are notallowed


In data.table 1.8.7 on Windowx x64.  Doing some poking around in

debug(data.table:::`[.data.table`) makes it seems like sum(len__)>

.Machine$integer.max after the binary merge, which seems like the
above errors might come from these lines in vecseq.c:

for (i=0; i<LENGTH(len); i++) reslen += INTEGER(len)[i];
ans = PROTECT(allocVector(INTSXP, reslen));

Does that mean this size and structure and dataset is bumping up
against R's vector size limits for this type of merge?
_______________________________________________
datatable-help mailing list
[email protected]



https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Advance warning

Reply via email to