Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

Arunkumar Srinivasan Wed, 01 May 2013 16:18:55 -0700

Eduard,  

What do you mean here: `at least when by is not there`. The "cross.apply" or 
".join" or "each.i" was supposedly an option when "i" argument is a 
`data.table`, right? I can find a reason why there would be a `by` there… (I 
mean an explicit by). Do you mean the implicit by when it's true? if not, could 
you elaborate (maybe with an example)?



Arun


On Thursday, May 2, 2013 at 12:47 AM, Eduard Antonyan wrote:

> yeah, I think cross.apply is pretty clear as well, at least when an extra 
> 'by' is not there, but I like each.i when there is a 'by'. Either way this is 
> a pretty small consideration for me and I'd be perfectly happy with either.
>  
>  
> On Wed, May 1, 2013 at 5:36 PM, Arunkumar Srinivasan <[email protected] 
> (mailto:[email protected])> wrote:
> > In retrospect, `.join` is also confusing/untrue (as the data.table join is 
> > still being done). I find `cross.apply` clearer.  
> >  
> > Arun
> >  
> >  
> > On Thursday, May 2, 2013 at 12:33 AM, Arunkumar Srinivasan wrote:
> >  
> >  
> > > Eduard,  
> > >  
> > > Yes, that clears it up. If `.join` if FALSE, then there's no 
> > > `by-without-by`, basically. `drop` really serves another purpose.
> > >  
> > > Once again, I find `each.i = TRUE/FALSE` to be confusing (as it was one 
> > > of the intended purposes of this post to begin with) to mean to apply to 
> > > *any* `i` operation. Unless this is true, I'd like to stick to `.join` as 
> > > it's what we are setting to FALSE/TRUE here.  
> > >  
> > > Thanks for the patient clarifications.
> > >  
> > > Arun
> > >  
> > >  
> > > On Thursday, May 2, 2013 at 12:28 AM, Eduard Antonyan wrote:
> > >  
> > > > Arun, from my previous email:
> > > >  
> > > > "Take 'dt' and apply 'i' and return 'j' (for any 'i' and 'j') by 'b':  
> > > >   dt[i, j, by = b] <-> dt[i][, j, by = b] in general, but also dt[i, j, 
> > > > by = b] if 'i' is not a join, and can also be dt[i, j, by = b] if 'i' 
> > > > is a join in some cases but not others
> > > >  
> > > > Take 'dt' and apply 'i' and return j, applying 
> > > > cross-apply/by-without-by (will do cross-apply only when 'i' is a 
> > > > join):  
> > > >   dt[i, j, each.i = TRUE] <-> dt[i, j]"
> > > >  
> > > > Together with the default being each.i=FALSE, you can see that the 
> > > > answer to your question will be:
> > > >  
> > > > DT1[DT2, sum(y), each.i = FALSE, allow.cartesian = TRUE] <-> DT1[DT2, 
> > > > allow.cartesian=TRUE][, sum(y)], i.e.
> > > > [1] 21
> > > >  
> > > > and
> > > > DT1[DT2, sum(y), each.i = TRUE, allow.cartesian = TRUE] <-> DT1[DT2, 
> > > > sum(y), allow.cartesian=TRUE], i.e.
> > > >    x V1
> > > > 1: 1  6
> > > > 2: 2  9
> > > > 3: 1  6
> > > >  
> > > >  
> > > >  
> > > > On Wed, May 1, 2013 at 5:23 PM, Arunkumar Srinivasan 
> > > > <[email protected] (mailto:[email protected])> wrote:
> > > > > eddi,  
> > > > >  
> > > > > sorry again, I am confused a bit now.  
> > > > >  
> > > > > DT1 <- data.table(x=c(1,1,1,2,2), y=1:5))
> > > > > DT2 <- data.table(x=c(1,2,1))
> > > > > setkey(DT1, "x")
> > > > >  
> > > > > What's the intended result for `DT1[DT2, sum(y), allow.cartesian = 
> > > > > TRUE, .join = FALSE]` ? c(6,9,6) or 21?
> > > > >  
> > > > >  
> > > > > Arun
> > > > >  
> > > > >  
> > > > > On Thursday, May 2, 2013 at 12:20 AM, Arunkumar Srinivasan wrote:
> > > > >  
> > > > > > Sorry the proposed result was a wrong paste in the last message:  
> > > > > >  
> > > > > >     # proposed way and the result:
> > > > > >     DT1[DT2, sum(y), .join = FALSE]
> > > > > >     [1] 6 9 6
> > > > > >  
> > > > > >  
> > > > > > And the last part that it *should* be a data.table is quite obvious 
> > > > > > then.
> > > > > >  
> > > > > > Arun
> > > > > >  
> > > > > >  
> > > > > > On Thursday, May 2, 2013 at 12:16 AM, Arunkumar Srinivasan wrote:
> > > > > >  
> > > > > > > Eduard,  
> > > > > > >  
> > > > > > > Great. That explains me the difference between `drop` and `.join` 
> > > > > > > here.  
> > > > > > > Even though I don't *need* this feature (I can't recall the last 
> > > > > > > time when I use a `data.table` for `i` and had to reduce the 
> > > > > > > function, say, sum). But, I think it can only better the usage.
> > > > > > >  
> > > > > > > However, there's one point *I think* would still disagree with 
> > > > > > > @eddi here, not sure.  
> > > > > > >  
> > > > > > >     DT1 <- data.table(x=c(1,1,1,2,2), y=1:5)
> > > > > > >     DT2 <- data.table(x=c(1,2,1))
> > > > > > >     setkey(DT1, "x")
> > > > > > >  
> > > > > > >     # proposed way and the result:
> > > > > > >     DT1[DT2, sum(y), .join = FALSE]
> > > > > > >     [1] 21
> > > > > > >  
> > > > > > >  
> > > > > > > So far nice. However, the operation `DT1[DT2, sum(y), .join = 
> > > > > > > TRUE]` *should* result in a `data.table` output as follows (it's 
> > > > > > > even more clearer now that .join is set to TRUE, meaning it's a 
> > > > > > > data.table join):  
> > > > > > >  
> > > > > > >    x V1
> > > > > > > 1: 1  6
> > > > > > > 2: 2  9
> > > > > > > 3: 1  6
> > > > > > >  
> > > > > > >  
> > > > > > > Basically, `.join = TRUE` is the current functionality unchanged 
> > > > > > > and nice to be default (as Matthew hinted).  
> > > > > > >      
> > > > > > > Arun
> > > > > > >  
> > > > > > >  
> > > > > > > On Tuesday, April 30, 2013 at 5:03 PM, Eduard Antonyan wrote:
> > > > > > >  
> > > > > > > > Arun,
> > > > > > > >  
> > > > > > > > Yes, DT1[DT2, y, .JOIN = FALSE] would do the same as DT1[DT2][, 
> > > > > > > > y] does currently.
> > > > > > > > No, DT1[DT2, y, .JOIN=FALSE], will NOT do a by-without-by, 
> > > > > > > > which is literally a 'by' by each of the rows of DT2 that are 
> > > > > > > > in the join (thus each.i! - the operation 'y' will be performed 
> > > > > > > > for each of the rows of 'i' and then combined and returned). 
> > > > > > > > There is no efficiency issue here that I can see, but Matthew 
> > > > > > > > can correct me on this. As far as I understand the efficiency 
> > > > > > > > comes into play when e.g. the rows of 'i' are unique, and after 
> > > > > > > > the join you'd like to do a 'by' by those, then DT1[DT2][, j, 
> > > > > > > > by = key(DT1)] would be less efficient since the 'by' could've 
> > > > > > > > already been done while joining.
> > > > > > > >  
> > > > > > > > DT1[DT2, .JOIN=FALSE] would be equivalent to both current and 
> > > > > > > > future DT1[DT2] - in this expression there is no by-without-by 
> > > > > > > > happening in either case.
> > > > > > > >  
> > > > > > > > The purpose of this is NOT for j just being a column or an 
> > > > > > > > expression that gets evaluated into a signal column. It applies 
> > > > > > > > to any j. The extra 'by-without-by' column is currently output 
> > > > > > > > independently of how many columns you output in your 
> > > > > > > > j-expression, the behavior is very similar as to when you 
> > > > > > > > specify a by=., except that the 'by' happens by a very special 
> > > > > > > > expression, that only exists when joining two data-tables and 
> > > > > > > > that generally doesn't exist before or after the join.
> > > > > > > >  
> > > > > > > > Hope this answers your questions.
> > > > > > > >  
> > > > > > > >  
> > > > > > > > On Tue, Apr 30, 2013 at 8:48 AM, Arunkumar Srinivasan 
> > > > > > > > <[email protected] (mailto:[email protected])> wrote:
> > > > > > > > > Eduard, thanks for your reply. But somethings are unclear to 
> > > > > > > > > me still. I'll try to explain them below.  
> > > > > > > > >  
> > > > > > > > > First I prefer .JOIN (or cross.apply) just because `each.i` 
> > > > > > > > > seems general (that it is applicable to *every* i operation, 
> > > > > > > > > which as of now seems untrue). .JOIN is specific to 
> > > > > > > > > data.table type for `i`.  
> > > > > > > > >  
> > > > > > > > > From what I understand from your reply, if (.JOIN = FALSE), 
> > > > > > > > > then,
> > > > > > > > >  
> > > > > > > > >     DT1[DT2, y, .JOIN = FALSE] <=> DT1[DT2][, y]
> > > > > > > > >  
> > > > > > > > > Is this right? It's a bit confusing because I think you're 
> > > > > > > > > okay with "by-without-by" and I got the impression from Sadao 
> > > > > > > > > that he finds the syntax of "by-without-by" 
> > > > > > > > > unaccessible/advanced for basic users. So, just to clarify, 
> > > > > > > > > here the DT1[DT2, y, .JOIN=FALSE] will still do the 
> > > > > > > > > "by-without-by" and then result in a "vector", right?    
> > > > > > > > >  
> > > > > > > > > Matthew explains in the current documentation that DT1[DT2][, 
> > > > > > > > > y] would "join" all columns of DT1 and DT2 and then subset. I 
> > > > > > > > > assume the implementation underneath is *not* DT1[DT2][, y] 
> > > > > > > > > rather the result is an efficient equivalence. Then, that of 
> > > > > > > > > course seems alright to me.  
> > > > > > > > >  
> > > > > > > > > If what I've told so far is right, then the syntax `DT1[DT2, 
> > > > > > > > > .JOIN=FALSE]` doesn't make sense/has no purpose to me. At 
> > > > > > > > > least I can't think of any at the moment.  
> > > > > > > > >  
> > > > > > > > > To conclude, IMHO, if the purpose of `.JOIN` is to provide 
> > > > > > > > > the same as DT1[i, j] for DT1[DT2, j] (j being a column or an 
> > > > > > > > > expression that results in getting evaluated as a scalar for 
> > > > > > > > > every group in the current by-without-by syntax), then, I 
> > > > > > > > > find this is covered in `drop = TRUE/FALSE`. Correct me if I 
> > > > > > > > > am wrong. But, one could do: `DT1[DT2, j, drop=TRUE]` instead 
> > > > > > > > > of `DT1[DT2, j, .JOIN=FALSE]` and DT1[i, j, drop=FALSE] 
> > > > > > > > > instead of DT1[i, list(x,y)].  
> > > > > > > > >  
> > > > > > > > > If you/anyone believes it's wrong, I'd be all ears to clarify 
> > > > > > > > > as to what's the purpose of `drop` then (and also how it 
> > > > > > > > > *doesn't* suit here as compared to .JOIN).
> > > > > > > > >  
> > > > > > > > > Arun
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > > > On Tuesday, April 30, 2013 at 2:54 PM, Eduard Antonyan wrote:
> > > > > > > > >  
> > > > > > > > > > Arun,
> > > > > > > > > >  
> > > > > > > > > > If the new boolean is false, the result would be the same 
> > > > > > > > > > as without it and would be equal to current behavior of 
> > > > > > > > > > d[i][, j]. If it's true, it will only have an effect if i 
> > > > > > > > > > is a join (I think each.i= fits slightly better for this 
> > > > > > > > > > description than .join=) - this will replicate current 
> > > > > > > > > > underlying behavior. If you think the cross-apply is 
> > > > > > > > > > something that could work not just for i being a data-table 
> > > > > > > > > > but other things as well, then it would make perfect sense 
> > > > > > > > > > to implement that action too when the bool is true.  
> > > > > > > > > >  
> > > > > > > > > > On Apr 30, 2013, at 2:58 AM, Arunkumar Srinivasan 
> > > > > > > > > > <[email protected] (mailto:[email protected])> 
> > > > > > > > > > wrote:
> > > > > > > > > >  
> > > > > > > > > > > (The earlier message was too long and was rejected.)  
> > > > > > > > > > > So, from the discussion so far, I see that Matthew is 
> > > > > > > > > > > nice enough to implement `.JOIN` or `cross.apply`. I've a 
> > > > > > > > > > > couple of questions. Suppose,
> > > > > > > > > > >  
> > > > > > > > > > >     DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)  
> > > > > > > > > > >     setkey(DT1, "x")
> > > > > > > > > > >     DT2 <- data.table(x=1)
> > > > > > > > > > >     DT1[DT2, y, .JOIN=TRUE] # I guess the syntax is 
> > > > > > > > > > > something like this. I expect here the same output as 
> > > > > > > > > > > current DT1[DT2, y]
> > > > > > > > > > >  
> > > > > > > > > > > The above syntax seems "okay". But my first question is 
> > > > > > > > > > > what is `.JOIN=FALSE` supposed to do under these two 
> > > > > > > > > > > circumstances? Suppose,  
> > > > > > > > > > >  
> > > > > > > > > > >     DT1 <- data.table(x=c(1,1,2,3,3), y=1:5, z=6:10)  
> > > > > > > > > > >     setkey(DT1, "x")
> > > > > > > > > > >     DT2 <- data.table(x=c(1,2,1), w=c(11:13))
> > > > > > > > > > >     # what's the output supposed to be for?
> > > > > > > > > > >     DT1[DT2, y, .JOIN=FALSE]
> > > > > > > > > > >     DT1[DT2, .JOIN = FALSE]
> > > > > > > > > > >  
> > > > > > > > > > > Depending on this I'd have to think about `drop = 
> > > > > > > > > > > TRUE/FALSE`. Also, how does it work with `subset`?  
> > > > > > > > > > >  
> > > > > > > > > > >     DT1[x %in% c(1,2,1), y, .JOIN=TRUE] # .JOIN is 
> > > > > > > > > > > ignored?
> > > > > > > > > > >  
> > > > > > > > > > > Is this supposed to also do a "cross-apply" on the 
> > > > > > > > > > > logical subset? I guess not. So, .JOIN is an "extra" 
> > > > > > > > > > > parameter that comes into play *only* when `i` is a 
> > > > > > > > > > > `data.table`?  
> > > > > > > > > > >  
> > > > > > > > > > > I'd love to have some replies to these questions for me 
> > > > > > > > > > > to take a stance on `.JOIN`. Thank you.
> > > > > > > > > > >  
> > > > > > > > > > > Best,
> > > > > > > > > > > Arun.
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > > >  
> > > > > > > > >  
> > > > > > > > >  
> > > > > > > >  
> > > > > > >  
> > > > > >  
> > > > >  
> > > >  
> > >  
> >  
>

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

Reply via email to