In retrospect, `.join` is also confusing/untrue (as the data.table join is still being done). I find `cross.apply` clearer.
Arun On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote: > Eduard, > > Yes, that clears it up. If `.join` if FALSE, then there's no `by-without-by`, > basically. `drop` really serves another purpose. > > Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one of > the intended purposes of this post to begin with) to mean to apply to *any* > `i` operation. Unless this is true, I'd like to stick to `.join` as it's what > we are setting to FALSE/TRUE here. > > Thanks for the patient clarifications. > > Arun > > > On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote: > > > Arun, from my previous email: > > > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b': > > dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, by > > = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' is a > > join in some cases but not others > > > > Take 'dt' and apply 'i' and return j, applying cross-apply/by-without-by > > (will do cross-apply only when 'i' is a join): > > dt[i, j, each.i = TRUE] <-> dt[i, j]" > > > > Together with the default being each.i=FALSE, you can see that the answer > > to your question will be: > > > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, > > allow.cartesian=TRUE][, sum(y)], i.e. > > [1] 21 > > > > and > > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, > > sum(y), allow.cartesian=TRUE], i.e. > > x V1 > > 1: 1 6 > > 2: 2 9 > > 3: 1 6 > > > > > > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan <[email protected] > > (mailto:[email protected])> wrote: > > > eddi, > > > > > > sorry again, I am confused a bit now. > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)) > > > DT2 <- data.table(x=c(1,2,1)) > > > setkey(DT1, "x") > > > > > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = TRUE, > > > .join = FALSE]` ? c(6,9,6) or 21? > > > > > > > > > Arun > > > > > > > > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote: > > > > > > > Sorry the proposed result was a wrong paste in the last message: > > > > > > > > # proposed way and the result: > > > > DT1[DT2, sum(y), .join = FALSE] > > > > [1] 6 9 6 > > > > > > > > > > > > And the last part that it *should* be a data.table is quite obvious > > > > then. > > > > > > > > Arun > > > > > > > > > > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote: > > > > > > > > > Eduard, > > > > > > > > > > Great. That explains me the difference between `drop` and `.join` > > > > > here. > > > > > Even though I don't *need* this feature (I can't recall the last time > > > > > when I use a `data.table` for `i` and had to reduce the function, > > > > > say, sum). But, I think it can only better the usage. > > > > > > > > > > However, there's one point *I think* would still disagree with @eddi > > > > > here, not sure. > > > > > > > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5) > > > > > DT2 <- data.table(x=c(1,2,1)) > > > > > setkey(DT1, "x") > > > > > > > > > > # proposed way and the result: > > > > > DT1[DT2, sum(y), .join = FALSE] > > > > > [1] 21 > > > > > > > > > > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = TRUE]` > > > > > *should* result in a `data.table` output as follows (it's even more > > > > > clearer now that .join is set to TRUE, meaning it's a data.table > > > > > join): > > > > > > > > > > x V1 > > > > > 1: 1 6 > > > > > 2: 2 9 > > > > > 3: 1 6 > > > > > > > > > > > > > > > Basically, `.join = TRUE` is the current functionality unchanged and > > > > > nice to be default (as Matthew hinted). > > > > > > > > > > Arun > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote: > > > > > > > > > > > Arun, > > > > > > > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, y] > > > > > > does currently. > > > > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, which is > > > > > > literally a 'by' by each of the rows of DT2 that are in the join > > > > > > (thus each.i! - the operation 'y' will be performed for each of the > > > > > > rows of 'i' and then combined and returned). There is no efficiency > > > > > > issue here that I can see, but Matthew can correct me on this. As > > > > > > far as I understand the efficiency comes into play when e.g. the > > > > > > rows of 'i' are unique, and after the join you'd like to do a 'by' > > > > > > by those, then DT1[DT2][, j, by = key(DT1)] would be less efficient > > > > > > since the 'by' could've already been done while joining. > > > > > > > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and > > > > > > future DT1[DT2] - in this expression there is no by-without-by > > > > > > happening in either case. > > > > > > > > > > > > The purpose of this is NOT for j just being a column or an > > > > > > expression that gets evaluated into a signal column. It applies to > > > > > > any j. The extra 'by-without-by' column is currently output > > > > > > independently of how many columns you output in your j-expression, > > > > > > the behavior is very similar as to when you specify a by=., except > > > > > > that the 'by' happens by a very special expression, that only > > > > > > exists when joining two data-tables and that generally doesn't > > > > > > exist before or after the join. > > > > > > > > > > > > Hope this answers your questions. > > > > > > > > > > > > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan > > > > > > <[email protected] (mailto:[email protected])> wrote: > > > > > > > Eduard, thanks for your reply. But somethings are unclear to me > > > > > > > still. I'll try to explain them below. > > > > > > > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` seems > > > > > > > general (that it is applicable to *every* i operation, which as > > > > > > > of now seems untrue). .JOIN is specific to data.table type for > > > > > > > `i`. > > > > > > > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), then, > > > > > > > > > > > > > > DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y] > > > > > > > > > > > > > > Is this right? It's a bit confusing because I think you're okay > > > > > > > with "by-without-by" and I got the impression from Sadao that he > > > > > > > finds the syntax of "by-without-by" unaccessible/advanced for > > > > > > > basic users. So, just to clarify, here the DT1[DT2, y, > > > > > > > .JOIN=FALSE] will still do the "by-without-by" and then result in > > > > > > > a "vector", right? > > > > > > > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, y] > > > > > > > would "join" all columns of DT1 and DT2 and then subset. I assume > > > > > > > the implementation underneath is *not* DT1[DT2][, y] rather the > > > > > > > result is an efficient equivalence. Then, that of course seems > > > > > > > alright to me. > > > > > > > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, > > > > > > > .JOIN=FALSE]` doesn't make sense/has no purpose to me. At least I > > > > > > > can't think of any at the moment. > > > > > > > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide the > > > > > > > same as DT1[i, j] for DT1[DT2, j] (j being a column or an > > > > > > > expression that results in getting evaluated as a scalar for > > > > > > > every group in the current by-without-by syntax), then, I find > > > > > > > this is covered in `drop = TRUE/FALSE`. Correct me if I am wrong. > > > > > > > But, one could do: `DT1[DT2, j, drop=TRUE]` instead of `DT1[DT2, > > > > > > > j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] instead of DT1[i, > > > > > > > list(x,y)]. > > > > > > > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify as > > > > > > > to what's the purpose of `drop` then (and also how it *doesn't* > > > > > > > suit here as compared to .JOIN). > > > > > > > > > > > > > > Arun > > > > > > > > > > > > > > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote: > > > > > > > > > > > > > > > Arun, > > > > > > > > > > > > > > > > If the new boolean is false, the result would be the same as > > > > > > > > without it and would be equal to current behavior of d[i][, j]. > > > > > > > > If it's true, it will only have an effect if i is a join (I > > > > > > > > think each.i= fits slightly better for this description than > > > > > > > > .join=) - this will replicate current underlying behavior. If > > > > > > > > you think the cross-apply is something that could work not just > > > > > > > > for i being a data-table but other things as well, then it > > > > > > > > would make perfect sense to implement that action too when the > > > > > > > > bool is true. > > > > > > > > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan > > > > > > > > <[email protected] (mailto:[email protected])> wrote: > > > > > > > > > > > > > > > > > (The earlier message was too long and was rejected.) > > > > > > > > > So, from the discussion so far, I see that Matthew is nice > > > > > > > > > enough to implement `.JOIN` or `cross.apply`. I've a couple > > > > > > > > > of questions. Suppose, > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > setkey(DT1, "x") > > > > > > > > > DT2 <- data.table(x=1) > > > > > > > > > DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is something > > > > > > > > > like this. I expect here the same output as current DT1[DT2, > > > > > > > > > y] > > > > > > > > > > > > > > > > > > The above syntax seems "okay". But my first question is what > > > > > > > > > is `.JOIN=FALSE` supposed to do under these two > > > > > > > > > circumstances? Suppose, > > > > > > > > > > > > > > > > > > DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10) > > > > > > > > > setkey(DT1, "x") > > > > > > > > > DT2 <- data.table(x=c(1,2,1), w=c(11:13)) > > > > > > > > > # what's the output supposed to be for? > > > > > > > > > DT1[DT2, y, .JOIN=FALSE] > > > > > > > > > DT1[DT2, .JOIN = FALSE] > > > > > > > > > > > > > > > > > > Depending on this I'd have to think about `drop = > > > > > > > > > TRUE/FALSE`. Also, how does it work with `subset`? > > > > > > > > > > > > > > > > > > DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is ignored? > > > > > > > > > > > > > > > > > > Is this supposed to also do a "cross-apply" on the logical > > > > > > > > > subset? I guess not. So, .JOIN is an "extra" parameter that > > > > > > > > > comes into play *only* when `i` is a `data.table`? > > > > > > > > > > > > > > > > > > I'd love to have some replies to these questions for me to > > > > > > > > > take a stance on `.JOIN`. Thank you. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Arun. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
