Ah. See previous reply. I tend to get confused by the words 'left' and 'right'. The old SQL syntax *= and =* confused me too. One is left outer and the other is right outer but which is which and left and right I always had to do do a test example to check for myself.
When I see X[Y] I just think, well, X subset by Y. I don't think left and right. Consider also 3 way joins such as X[Y][Z] vs X[Y[Z]]. X appears left of Y and Y left of Z in both cases, but they mean different (and each useful) things. In X[Y] since X is being subset using Y, the result is all X's columns followed by the JIS columns from Y (if any). I'm also trying to use the names "x" and "i" to refer generically to X[Y] where x=X and i=Y. merge() is more useful than I've admited in the past. There's a FR to add by.x and by.y to that, which'll be the way to join by column name when needed. > Matt, > > Two things > 1) I noticed that in Juliet's example, Y's first column is not named at > all, so defaults to a name of "V1", yet the join carries through. How > does the underlying implementation work here? I believe this was the > original question, and I'm curious as well now. > 2) Having re-read the paragraph you sent, I now realize that X[Y] is > equivalent to X right join Y. It may be helpful to explicitly write this > in. I certainly had this flipped (I thought it was X left join Y), and > the column ordering is flipped from standard SQL. > > On 6/7/2012 6:00 AM, [email protected] > wrote: >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Wed, 6 Jun 2012 13:47:00 -0400 >> From: Juliet Hannah<[email protected]> >> To: [email protected] >> Subject: [datatable-help] join example from faq >> Message-ID: >> <CALzuZRQw2QaR=uAfPbmt_R8qnZEhLF7-=_zsoxahxsuhcdc...@mail.gmail.com> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> All, >> >> I am not understanding a few basic things. I am looking at pg 5 of the >> faq. >> >> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7) >> setkey(X,grp) >> Y = data.table(c("b","c"), bar=c(4,2)) >> X[Y] >> >> The faq says X[Y] is a join looking up X's rows using Y. >> >> Does this mean data.table looks up X's key using Y? >> >> Y has two columns. How does it know to use the first column in this >> example? Y's key has not been set. >> >> Hope my question is not too obvious. :) >> >> Thanks, >> >> Juliet >> >> P.S. Thanks for the wonderful package. I had to do some aggregations >> the other day and my other solutions were >> running for hours, but data.table finished in a couple of minutes! >> >> >> ------------------------------ >> >> Message: 2 >> Date: Wed, 06 Jun 2012 23:28:08 +0100 >> From: Matthew Dowle<[email protected]> >> To: Juliet Hannah<[email protected]> >> Cc: [email protected] >> Subject: Re: [datatable-help] join example from faq >> Message-ID:<1339021688.2573.16.camel@netbook> >> Content-Type: text/plain; charset="UTF-8" >> >> Hi, >> >> Hopefully this paragraph from ?data.table sheds some light : >> >> "When i is a data.table, x must have a key. i is joined to x using >> the >> key and the rows in x that match are returned. An equi-join is performed >> between each column in i to each column in x's key. The match is a >> binary search in compiled C in O(log n) time. If i has less columns than >> x's key then many rows of x may match to each row of i. If i has more >> columns than x's key, the columns of i not involved in the join are >> included in the result. If i also has a key, it is i's key columns that >> are used to match to x's key columns and a binary merge of the two >> tables is carried out." >> >> The critical sentence is "If i also has a key ..."; i.e., i doesn't have >> to be keyed. Only x must have a key. It's often faster if i is keyed >> too, though. There have been some speed improvements in 1.8.1, too. >> >> In the example you highlighted I think it goes on to show 'join >> inherited scope', which is this paragraph : >> >> "Advanced: In the X[Y,j] form of grouping, the j expression sees >> variables in X first, then Y. We call this join inherited scope. If the >> variable is not in X or Y then the calling frame is searched, its >> calling frame, and so on in the usual way up to and including the global >> environment." >> >> Encouraging to hear you've reduced hours to minutes. At least someone >> knows not to use benchmark(...,replications=100) for tasks that take >> under 0.01 seconds and then conclude data.table is slow! >> >> Matthew >> >> On Wed, 2012-06-06 at 13:47 -0400, Juliet Hannah wrote: >>> All, >>> >>> I am not understanding a few basic things. I am looking at pg 5 of the >>> faq. >>> >>> X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7) >>> setkey(X,grp) >>> Y = data.table(c("b","c"), bar=c(4,2)) >>> X[Y] >>> >>> The faq says X[Y] is a join looking up X's rows using Y. >>> >>> Does this mean data.table looks up X's key using Y? >>> >>> Y has two columns. How does it know to use the first column in this >>> example? Y's key has not been set. >>> >>> Hope my question is not too obvious. :) >>> >>> Thanks, >>> >>> Juliet >>> >>> P.S. Thanks for the wonderful package. I had to do some aggregations >>> the other day and my other solutions were >>> running for hours, but data.table finished in a couple of minutes! >> >> >> >> ------------------------------ >> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> End of datatable-help Digest, Vol 28, Issue 2 >> ********************************************* > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
