Where it says any.y= it should have read all.y=. On Sat, May 4, 2013 at 5:46 AM, Gabor Grothendieck <[email protected]> wrote: > One further comment on nomatch=0 weirdness. It seems that the value > of nomatch= is the row index of the row of X to return if a row in Y > matches no row in X here: X[Y,,nomatch=?] In ordinary R indexing > using an index value of 0 means drop the corresponding component and > NA means return an NA. nomatch=1 would presumably return the first > row of X for non-matching rows of Y but, in fact, nomatch= seems to be > restricted to 0 and NA as any other value generates an error message > to this effect. Likely it was decided that values other than 0 and NA > would be too bizarre and most likely represent user error. If any.y= > were used then it would naturally be logical and this artificial > distinction (i.e .between 0/NA on one hand and everything else on the > other hand) would not have to be made. > > On Fri, May 3, 2013 at 6:41 PM, Gabor Grothendieck > <[email protected]> wrote: >> In thinking about this a bit more I can see the argument for leaving >> the default at nomatch=NA. Consider these examples of indexing: >> >>> letters[27] >> [1] NA >>> BOD[7,] >> Time demand >> NA NA NA >> >> nomatch=NA seems more compatible with these examples than nomatch=0. >> >> (At the same time this does not mean we could not also change the >> argument name from nomatch= to all.y= and add the other merge >> arguments (all.x=, by.x=, by.y=, by=) as well since it remains the >> case that R's merge() seems closer than R's match() to this >> functionality regardless of the default.) >> >> >> On Fri, May 3, 2013 at 4:42 PM, Gabor Grothendieck >> <[email protected]> wrote: >>> One can view data.table's generalization of indexing as the >>> realization that all indexing can conceptually be viewed as merging >>> where indexing with numeric values corresponds to merging with the >>> data.table's row numbers and indexing with logical values, L, is >>> equivalent to merging with which(L) so there are really not two types: >>> indexing and merging but just one type: merging that covers them all. >>> >>> >>> On Fri, May 3, 2013 at 1:01 PM, Arunkumar Srinivasan >>> <[email protected]> wrote: >>>> I am wondering if performing X[Y] as a "merge" in correspondence with R's >>>> base "merge", if the basic idea of "i" becomes confusing. That is, when "i" >>>> is not a data.table in X[i] it indexes by rows. When `i` is a data.table, >>>> instead of the current definition which is in par with the subletting >>>> operation that use `i` (here data.table) as an index to subset X and then >>>> JOIN both X and Y, we say, here X and Y are data.tables and we perform a >>>> merge. I think this becomes confusing regarding the purpose of `i`. >>>> >>>> Remember that the main purpose of having the X[Y] is to have the >>>> flexibility >>>> of using `j` to to filter/subset only the desired columns. So, for example >>>> if you want to get 1 column of Y out of 100 columns when joining, you do: >>>> X[Y, list(cols_of_x, one_col_of_y)] and here, it doesn't go with the >>>> traditional definition of merge. >>>> >>>> As much as I like the idea of having consistent syntax, I also love the >>>> feature of X[Y, j]. So I'm confused as to how to deal with this. >>>> >>>> Arun >>>> >>>> On Friday, May 3, 2013 at 6:54 PM, Gabor Grothendieck wrote: >>>> >>>> I think that from the viewpoint of compatibility and convenience it >>>> would be best to implement all.x and all.y and not rely on swapping X >>>> and Y. SQLite did something like this (they implemented left join but >>>> not right join based on the idea that all you have to do is swap join >>>> arguments) but the problem with it is that it adds a layer of mental >>>> specification effort if the actual problem is better stated in the >>>> unsupported orientation. >>>> >>>> On Fri, May 3, 2013 at 12:49 PM, Eduard Antonyan >>>> <[email protected]> wrote: >>>> >>>> Arun, it only needs the addition of smth like X[Y, keep.all = TRUE], all of >>>> the other merge options already exist as either X[Y] or Y[X] with or >>>> without >>>> nomatch = 0/NA. >>>> >>>> >>>> On Fri, May 3, 2013 at 11:45 AM, Arunkumar Srinivasan >>>> <[email protected]> wrote: >>>> >>>> >>>> Gabor, >>>> >>>> Very true. I suppose your request is that the x[i] where `i` is a >>>> data.table should have the same set of options like R's base `merge` >>>> function, like, by.y=TRUE, by.x=TRUE or all=TRUE. I like the idea by >>>> itself. >>>> However, I am not able to think of a way to do this. I mean, I find the >>>> syntax X[Y, by.x=TRUE] weird / not making sense. That is, to me even though >>>> >>>> X[Y] is equal to Y[X, by.y=TRUE] (or) X[Y, by.x=TRUE] (ignoring the >>>> reordered columns) the latter 2 don't seem to make sense/is redundant >>>> (maybe >>>> it's because I am used to this syntax). >>>> >>>> Arun >>>> >>>> On Friday, May 3, 2013 at 5:57 PM, Gabor Grothendieck wrote: >>>> >>>> In my last post it should have read: >>>> >>>> That X[Y] is not the same as Y[X] is analogous to the fact that >>>> merge(X, Y, all.y=TRUE) is not the same as merge(Y, X, all.y=TRUE) >>>> >>>> On Fri, May 3, 2013 at 11:55 AM, Gabor Grothendieck >>>> <[email protected]> wrote: >>>> >>>> Assuming same-named keys, then these are all the same except possibly >>>> for row and column order: >>>> >>>> X[Y,,nomatch=0] >>>> Y[X,,nomatch=0] >>>> merge(X, Y) >>>> merge(Y, X) >>>> >>>> That X[Y] is not the same as Y[X] is analogous to the fact that >>>> merge(X, Y, all.x=TRUE) is not the same as merge(Y, X, all.x=TRUE) >>>> >>>> On Fri, May 3, 2013 at 11:46 AM, Arunkumar Srinivasan >>>> <[email protected]> wrote: >>>> >>>> Gabor, >>>> >>>> X[Y] and Y[X] are not necessarily the same operations (meaning, they don't >>>> *have* to give the same output). However, merge(X,Y) and merge(Y,X) *have* >>>> to provide the same output (except for the column order and names). In >>>> that >>>> sense, a join is a bit different from a merge, no? >>>> >>>> Arun >>>> >>>> On Friday, May 3, 2013 at 5:36 PM, Gabor Grothendieck wrote: >>>> >>>> Yes, except that is not really what happens since match() only matches >>>> one row whereas with mult="all", the default, all rows are matched >>>> which is not really matching in the sense of match(). The current >>>> naming confuses matching with joining and its really the latter that >>>> is being done. >>>> >>>> Regarding the existence of merge the advantage of [ is that it will >>>> automatically only take the columns needed so merge is not really >>>> equivalent to [ in all respects. Furthermore having to use different >>>> constructs for different types of merge seems awkward. >>>> >>>> >>>> On Fri, May 3, 2013 at 11:27 AM, Eduard Antonyan >>>> <[email protected]> wrote: >>>> >>>> Btw the way I think about the "nomatch" name is as follows - normally X[Y] >>>> tries to match rows of Y with rows of X, and then "nomatch" tells it what >>>> to >>>> do when there is *no match*. >>>> >>>> >>>> On Fri, May 3, 2013 at 10:23 AM, Eduard Antonyan >>>> <[email protected]> >>>> wrote: >>>> >>>> >>>> To clarify - that behavior is already implemented in merge (more >>>> specifically merge.data.table). I don't really have a view on having it in >>>> X[Y] as well - I don't like all.x and all.y as the names, since there are >>>> no >>>> params named 'x' and 'y' in [.data.table (as opposed to merge), but some >>>> param that would do a full outer join could certainly be added. >>>> >>>> >>>> On Fri, May 3, 2013 at 10:09 AM, Gabor Grothendieck >>>> <[email protected]> wrote: >>>> >>>> >>>> Yes, sorry. Its nomatch= which presumably derives from the parameter >>>> of the same name in the match() function. If the idea of the nomatch= >>>> name was to leverage off existing argument names in R then I would >>>> prefer all.y= to be consistent with merge() in place of nomatch= since >>>> we are really merging/joining rather than just matching. That would >>>> also allow extension to all types of join by adding all.an x= argument >>>> too. >>>> >>>> On Fri, May 3, 2013 at 10:59 AM, Eduard Antonyan >>>> <[email protected]> wrote: >>>> >>>> I would prefer nomatch=0 as a default though, simply because that's >>>> what I >>>> do most of the time :) >>>> >>>> >>>> On Fri, May 3, 2013 at 9:57 AM, Eduard Antonyan >>>> <[email protected]> >>>> wrote: >>>> >>>> >>>> A correction - the param is called "nomatch", not "match". >>>> >>>> This use case seems like smth a user shouldn't really do - in an ideal >>>> world you should have them both keyed by the same-name column. >>>> >>>> As is, my view on it is that data.table is correcting the user mistake >>>> of >>>> naming the column in Y - y, instead of x, and so the output makes >>>> sense and >>>> I don't see the need of complicating the behavior by adding more cases >>>> one >>>> has to go through to figure out what the output columns would be. >>>> Similar to >>>> asking for X[J(c("b", "c", "d"))] - you wouldn't want an anonymous >>>> column >>>> there, would you? >>>> >>>> >>>> >>>> On Fri, May 3, 2013 at 6:18 AM, Gabor Grothendieck >>>> <[email protected]> wrote: >>>> >>>> >>>> I am moving this discussion which started with mdowle to the list. >>>> >>>> Consider this example slightly modified from the data.table FAQ: >>>> >>>> X = data.table(x=c("a","a","b","b","b","c","c"), foo=1:7, key="x") >>>> Y = data.table(y=c("b","c","d"), bar=c(4,2,3)) >>>> out <- X[Y]; out >>>> >>>> x foo bar >>>> 1: b 3 4 >>>> 2: b 4 4 >>>> 3: b 5 4 >>>> 4: c 6 2 >>>> 5: c 7 2 >>>> 6: d NA 3 >>>> >>>> Note that the first column of the output is labelled x even though >>>> the >>>> data to produce it comes from y, e.g. "d" in out$x is not in X$x but >>>> does appear in Y$y so clearly the data is coming from y as opposed to >>>> x . In terms of SQL the above would be written: >>>> >>>> select Y.y as x, ... >>>> >>>> and the need to renamne the first column of out suggests that there >>>> may be a deeper problem here. >>>> >>>> Here are some ideas to address this (they would require changes to >>>> data.table): >>>> >>>> - the default of X[Y,, match=NA] would be changed to a default of >>>> X[Y,,match=0] so that it corresponds to the defaults in R's merge and >>>> in SQL joins. >>>> >>>> - the column name of the first column in the example above would be >>>> changed to y if match=0 but be left at x if match=NA. In the case >>>> that match=0 (the proposed new default) x and y are equal so the >>>> first >>>> column can be validly labelled as x but in the case that match=NA >>>> they >>>> are not so y would be used as the column name. >>>> >>>> - the name match= does seem a bit misleading since R's match only >>>> matches one item in the target whereas in data.table match matches >>>> many if mult="all" and that is the default. Perhaps some thought >>>> should be given to a name change here? >>>> >>>> The above would seem to correspond more closely to R's merge and SQL >>>> join defaults. Any use cases or other comments? >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> _______________________________________________ >>>> datatable-help mailing list >>>> [email protected] >>>> >>>> >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> _______________________________________________ >>>> datatable-help mailing list >>>> [email protected] >>>> >>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> >>>> >>>> >>>> >>>> -- >>>> Statistics & Software Consulting >>>> GKX Group, GKX Associates Inc. >>>> tel: 1-877-GKX-GROUP >>>> email: ggrothendieck at gmail.com >>>> >>>> >>> >>> >>> >>> -- >>> Statistics & Software Consulting >>> GKX Group, GKX Associates Inc. >>> tel: 1-877-GKX-GROUP >>> email: ggrothendieck at gmail.com >> >> >> >> -- >> Statistics & Software Consulting >> GKX Group, GKX Associates Inc. >> tel: 1-877-GKX-GROUP >> email: ggrothendieck at gmail.com > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com
-- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
