Sure, here's a recap. The most succinct way of putting it is - the meaning of d[i, j, by = b] is very complicated and unintuitive right now because of hidden by's in some cases and that statement can be made much more readable by making by-without-by's explicit. The longer version follows.
First let's go over what is done currently, in particular what exactly is by-without-by. The following example, adapted from Matthew's examples illustrates current behavior: > X = data.table(a = c(1,1,2,2,3,3), b = c(1:6), key = "a") > Y = data.table(a = c(1,2,1), key = "a") > X[Y] a b 1: 1 1 2: 1 2 3: 1 1 4: 1 2 5: 2 3 6: 2 4 > X[Y, sum(b)] a V1 1: 1 3 2: 1 3 3: 2 7 What's happening here is that the action j=sum(b) is performed for each row of Y (or rather each 'a') as if that was a 'by' by the rows of Y. Had Y had unique 'a' values only, this would've been equivalent to doing a 'by' by 'a' after the merge, but there is a difference when Y$a has duplicates. This is interesting behavior that can be used in a variety of situations (it also has an interesting leveraging point - if Y$a *is* unique and you'd like to do 'by=a' after the merge, it's more computationally advantageous to do the 'by' *during* the merge and not after), however it interferes with the naturally established action for d[i, j], where for other i's this would simply do action 'j', without doing an extra hidden 'by'. The proposal is thus to do the above special 'by' only when explicitly asked to - e.g. by adding a new boolean 'each.i = TRUE', the default value for which would be FALSE. This will make syntax much more readable and user-friendly, would eliminate a few FAQ points and would also allow a new kind of action, that afaik is actually not possible with current syntax. Here's some correspondences - left is new syntax and right is old syntax: Take 'dt' and apply 'i' (where 'i' is anything, including a join): dt[i] <-> dt[i] Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a join in some cases but not others Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by (will do cross-apply only when 'i' is a join): dt[i, j, each.i = TRUE] <-> dt[i, j] Take 'dt' and apply 'i', return j over *both* the cross-apply/by-without-by (for 'i' being a join only) and another specified 'by', think of this as doing by=list(b, rows of Y): dt[i, j, by = b, each.i = TRUE] <-> afaik there is no direct correspondence in current behavior On Tuesday, April 30, 2013, Ricardo Saporta wrote: > Eddi, > > Perhaps you could summarize succinctly, now after a good bit of > discussion, what your proposed change is. > > -Rick > > > On Tue, Apr 30, 2013 at 7:10 PM, statquant3 <[email protected]> wrote: > >> Hi, I red the 30 posts and I have to confess that I still do not >> understand >> the point of the changes... >> Could anyone kindly write an example of the current behaviour and what the >> new option will bring to the table ? >> Sorry... >> >> >> >> -- >> View this message in context: >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4665873.html >> Sent from the datatable-help mailing list archive at Nabble.com. >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
