Gabor, X[Y] and Y[X] are not necessarily the same operations (meaning, they don't *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* to provide the same output (except for the column order and names). In that sense, a join is a bit different from a merge, no?
Arun On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: > Yes, except that is not really what happens since match() only matches > one row whereas with mult="all", the default, all rows are matched > which is not really matching in the sense of match(). The current > naming confuses matching with joining and its really the latter that > is being done. > > Regarding the existence of merge the advantage of [ is that it will > automatically only take the columns needed so merge is not really > equivalent to [ in all respects. Furthermore having to use different > constructs for different types of merge seems awkward. > > > On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan > <[email protected] (mailto:[email protected])> wrote: > > Btw the way I think about the "nomatch" name is as follows - normally X[Y] > > tries to match rows of Y with rows of X, and then "nomatch" tells it what to > > do when there is *no match*. > > > > > > On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan <[email protected] > > (mailto:[email protected])> > > wrote: > > > > > > To clarify - that behavior is already implemented in merge (more > > > specifically merge.data.table). I don't really have a view on having it in > > > X[Y] as well - I don't like all.x and all.y as the names, since there are > > > no > > > params named 'x' and 'y' in [.data.table (as opposed to merge), but some > > > param that would do a full outer join could certainly be added. > > > > > > > > > On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck > > > <[email protected] (mailto:[email protected])> wrote: > > > > > > > > Yes, sorry. Its nomatch= which presumably derives from the parameter > > > > of the same name in the match() function. If the idea of the nomatch= > > > > name was to leverage off existing argument names in R then I would > > > > prefer all.y= to be consistent with merge() in place of nomatch= since > > > > we are really merging/joining rather than just matching. That would > > > > also allow extension to all types of join by adding all.an x= argument > > > > too. > > > > > > > > On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan > > > > <[email protected] (mailto:[email protected])> wrote: > > > > > I would prefer nomatch=0 as a default though, simply because that's > > > > > what I > > > > > do most of the time :) > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan > > > > > <[email protected] (mailto:[email protected])> > > > > > wrote: > > > > > > > > > > > > A correction - the param is called "nomatch", not "match". > > > > > > > > > > > > This use case seems like smth a user shouldn't really do - in an > > > > > > ideal > > > > > > world you should have them both keyed by the same-name column. > > > > > > > > > > > > As is, my view on it is that data.table is correcting the user > > > > > > mistake > > > > > > of > > > > > > naming the column in Y - y, instead of x, and so the output makes > > > > > > sense and > > > > > > I don't see the need of complicating the behavior by adding more > > > > > > cases > > > > > > one > > > > > > has to go through to figure out what the output columns would be. > > > > > > Similar to > > > > > > asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous > > > > > > column > > > > > > there, would you? > > > > > > > > > > > > > > > > > > > > > > > > On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck > > > > > > <[email protected] (mailto:[email protected])> wrote: > > > > > > > > > > > > > > I am moving this discussion which started with mdowle to the list. > > > > > > > > > > > > > > Consider this example slightly modified from the data.table FAQ: > > > > > > > > > > > > > > > X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, > > > > > > > > key="x") > > > > > > > > Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) > > > > > > > > out <- X[Y]; out > > > > > > > > > > > > > > > > > > > > > > x foo bar > > > > > > > 1: b 3 4 > > > > > > > 2: b 4 4 > > > > > > > 3: b 5 4 > > > > > > > 4: c 6 2 > > > > > > > 5: c 7 2 > > > > > > > 6: d NA 3 > > > > > > > > > > > > > > Note that the first column of the output is labelled x even though > > > > > > > the > > > > > > > data to produce it comes from y, e.g. "d" in out$x is not in X$x > > > > > > > but > > > > > > > does appear in Y$y so clearly the data is coming from y as > > > > > > > opposed to > > > > > > > x . In terms of SQL the above would be written: > > > > > > > > > > > > > > select Y.y as x, ... > > > > > > > > > > > > > > and the need to renamne the first column of out suggests that > > > > > > > there > > > > > > > may be a deeper problem here. > > > > > > > > > > > > > > Here are some ideas to address this (they would require changes to > > > > > > > data.table): > > > > > > > > > > > > > > - the default of X[Y,, match=NA] would be changed to a default of > > > > > > > X[Y,,match=0] so that it corresponds to the defaults in R's merge > > > > > > > and > > > > > > > in SQL joins. > > > > > > > > > > > > > > - the column name of the first column in the example above would > > > > > > > be > > > > > > > changed to y if match=0 but be left at x if match=NA. In the case > > > > > > > that match=0 (the proposed new default) x and y are equal so the > > > > > > > first > > > > > > > column can be validly labelled as x but in the case that match=NA > > > > > > > they > > > > > > > are not so y would be used as the column name. > > > > > > > > > > > > > > - the name match= does seem a bit misleading since R's match only > > > > > > > matches one item in the target whereas in data.table match matches > > > > > > > many if mult="all" and that is the default. Perhaps some thought > > > > > > > should be given to a name change here? > > > > > > > > > > > > > > The above would seem to correspond more closely to R's merge and > > > > > > > SQL > > > > > > > join defaults. Any use cases or other comments? > > > > > > > > > > > > > > -- > > > > > > > Statistics & Software Consulting > > > > > > > GKX Group, GKX Associates Inc. > > > > > > > tel: 1-877-GKX-GROUP > > > > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > _______________________________________________ > > > > > > > datatable-help mailing list > > > > > > > [email protected] > > > > > > > (mailto:[email protected]) > > > > > > > > > > > > > > > > > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > > > > > > > > > > > > > > > > -- > > > > Statistics & Software Consulting > > > > GKX Group, GKX Associates Inc. > > > > tel: 1-877-GKX-GROUP > > > > email: ggrothendieck at gmail.com (http://gmail.com) > > > > > > > > > > > > > > > > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com (http://gmail.com) > _______________________________________________ > datatable-help mailing list > [email protected] > (mailto:[email protected]) > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
