Hi Yike, In Matthew's email, he writes:
So in other words, when i has no key it's the first column of i to the first column of x's key, the 2nd column of i to the 2nd column of x's key, etc. I think this answers my original question. In the example (which is from the FAQ), i has no key. So it seems the *default* is to match the first column of i to x's key. That's how data.table "knows". It uses the first column as default. At least that's the way I understand things now. Matthew, please correct if I've misunderstood. Regards, Juliet On Thu, Jun 7, 2012 at 12:10 PM, Yike Lu <[email protected]> wrote: > > > On 6/7/2012 11:35 AM, Matthew Dowle wrote: >> >> >> Ah. See previous reply. > > ??? I understand that i doesn't have to be keyed, but how does data.table > know when the names are not even the same? Sorry, I don't see the answer to > that in your reply. > > By equi-join, do you mean it tests for equality on all possible combinations > of columns between [i] and [key(X)]? This seems like it could be rather > dangerous. > > >> I tend to get confused by the words 'left' and >> 'right'. The old SQL syntax *= and =* confused me too. One is left outer >> and the other is right outer but which is which and left and right I >> always had to do do a test example to check for myself. > > I can see where you're coming from. The system I came from had no right > join, although it would have been trivial to define one in terms of the left > join. So I got used to remembering "left as in the left table is the > 'larger' or the one without a proper key". > >> >> When I see X[Y] I just think, well, X subset by Y. I don't think left and >> right. Consider also 3 way joins such as X[Y][Z] vs X[Y[Z]]. X appears >> left of Y and Y left of Z in both cases, but they mean different (and each >> useful) things. > > Yeah, this is where it kind of gets confusing for me. To me, the standard > use of using the left outer is to leave NULL values in order to indicate > missingness in the left (unkeyed) table. There are several possibilities > though. Thinking about it in terms of Y %lj% X <-> X[Y] for a single given > row and as a toy example, a single key, set in both tables: > 1) key_value(X) in key_values(Y) - normal case, normal join > 2) key_value(X) NOT in key_values(Y) - join X gets filled in with nulls in > the join columns > 3) key_value(Y) NOT in key_values(X) - the row in Y gets lost. > > So in fact, X and Y can be supersets, subsets, completely disjoint, or > partially overlapping of each other and there would still be a well defined > action. > > So it seems both our mental models are somewhat wrong - for me, Y isn't > necessarily the larger table, for you Y isn't necessarily the subset. > > It's more accurate to think that Y's rows don't get touched, or that Y is > the table of interest, to which you are adding additional information. > > The reason I like thinking in joins is I prefer infix syntax. More elegant > to me to say setkey(Y, a, b) %lj% setkey(X, a, b) than setkey(X, a, > b)[setkey(Y, a, b)], and the chaining versus nested is easy as well... > > X[Y][Z] => (X %rj% Y) %rj% Z > X[Y[Z]] => X %rj% (Y %rj% Z) > >> >> >> In X[Y] since X is being subset using Y, the result is all X's columns >> followed by the JIS columns from Y (if any). I'm also trying to use the >> names "x" and "i" to refer generically to X[Y] where x=X and i=Y. >> >> merge() is more useful than I've admited in the past. There's a FR to add >> by.x and by.y to that, which'll be the way to join by column name when >> needed. > > I was wondering what the differences/similarities there were to merge(). The > only place I use it is for the alias outer join: %oj% = function(x, y) > merge(x, y) > >> >>> Matt, >>> >>> Two things >>> 1) I noticed that in Juliet's example, Y's first column is not named at >>> all, so defaults to a name of "V1", yet the join carries through. How >>> does the underlying implementation work here? I believe this was the >>> original question, and I'm curious as well now. >>> 2) Having re-read the paragraph you sent, I now realize that X[Y] is >>> equivalent to X right join Y. It may be helpful to explicitly write this >>> in. I certainly had this flipped (I thought it was X left join Y), and >>> the column ordering is flipped from standard SQL. >>> >>> On 6/7/2012 6:00 AM, [email protected] >>> wrote: >>>> >>>> >>>> ---------------------------------------------------------------------- >>>> >>>> Message: 1 >>>> Date: Wed, 6 Jun 2012 13:47:00 -0400 >>>> From: Juliet Hannah<[email protected]> >>>> To: [email protected] >>>> Subject: [datatable-help] join example from faq >>>> Message-ID: >>>> >>>> <CALzuZRQw2QaR=uAfPbmt_R8qnZEhLF7-=_zsoxahxsuhcdc...@mail.gmail.com> >>>> Content-Type: text/plain; charset=ISO-8859-1 >>>> >>>> All, >>>> >>>> I am not understanding a few basic things. I am looking at pg 5 of the >>>> faq. >>>> >>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7) >>>> setkey(X,grp) >>>> Y = data.table(c("b","c"), bar=c(4,2)) >>>> X[Y] >>>> >>>> The faq says X[Y] is a join looking up X's rows using Y. >>>> >>>> Does this mean data.table looks up X's key using Y? >>>> >>>> Y has two columns. How does it know to use the first column in this >>>> example? Y's key has not been set. >>>> >>>> Hope my question is not too obvious. :) >>>> >>>> Thanks, >>>> >>>> Juliet >>>> >>>> P.S. Thanks for the wonderful package. I had to do some aggregations >>>> the other day and my other solutions were >>>> running for hours, but data.table finished in a couple of minutes! >>>> >>>> >>>> ------------------------------ >>>> >>>> Message: 2 >>>> Date: Wed, 06 Jun 2012 23:28:08 +0100 >>>> From: Matthew Dowle<[email protected]> >>>> To: Juliet Hannah<[email protected]> >>>> Cc: [email protected] >>>> Subject: Re: [datatable-help] join example from faq >>>> Message-ID:<1339021688.2573.16.camel@netbook> >>>> Content-Type: text/plain; charset="UTF-8" >>>> >>>> Hi, >>>> >>>> Hopefully this paragraph from ?data.table sheds some light : >>>> >>>> "When i is a data.table, x must have a key. i is joined to x using >>>> the >>>> key and the rows in x that match are returned. An equi-join is performed >>>> between each column in i to each column in x's key. The match is a >>>> binary search in compiled C in O(log n) time. If i has less columns than >>>> x's key then many rows of x may match to each row of i. If i has more >>>> columns than x's key, the columns of i not involved in the join are >>>> included in the result. If i also has a key, it is i's key columns that >>>> are used to match to x's key columns and a binary merge of the two >>>> tables is carried out." >>>> >>>> The critical sentence is "If i also has a key ..."; i.e., i doesn't have >>>> to be keyed. Only x must have a key. It's often faster if i is keyed >>>> too, though. There have been some speed improvements in 1.8.1, too. >>>> >>>> In the example you highlighted I think it goes on to show 'join >>>> inherited scope', which is this paragraph : >>>> >>>> "Advanced: In the X[Y,j] form of grouping, the j expression sees >>>> variables in X first, then Y. We call this join inherited scope. If the >>>> variable is not in X or Y then the calling frame is searched, its >>>> calling frame, and so on in the usual way up to and including the global >>>> environment." >>>> >>>> Encouraging to hear you've reduced hours to minutes. At least someone >>>> knows not to use benchmark(...,replications=100) for tasks that take >>>> under 0.01 seconds and then conclude data.table is slow! >>>> >>>> Matthew >>>> >>>> On Wed, 2012-06-06 at 13:47 -0400, Juliet Hannah wrote: >>>>> >>>>> All, >>>>> >>>>> I am not understanding a few basic things. I am looking at pg 5 of the >>>>> faq. >>>>> >>>>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7) >>>>> setkey(X,grp) >>>>> Y = data.table(c("b","c"), bar=c(4,2)) >>>>> X[Y] >>>>> >>>>> The faq says X[Y] is a join looking up X's rows using Y. >>>>> >>>>> Does this mean data.table looks up X's key using Y? >>>>> >>>>> Y has two columns. How does it know to use the first column in this >>>>> example? Y's key has not been set. >>>>> >>>>> Hope my question is not too obvious. :) >>>>> >>>>> Thanks, >>>>> >>>>> Juliet >>>>> >>>>> P.S. Thanks for the wonderful package. I had to do some aggregations >>>>> the other day and my other solutions were >>>>> running for hours, but data.table finished in a couple of minutes! >>>> >>>> >>>> >>>> >>>> ------------------------------ >>>> >>>> _______________________________________________ >>>> datatable-help mailing list >>>> [email protected] >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> End of datatable-help Digest, Vol 28, Issue 2 >>>> ********************************************* >>> >>> >> > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
