Re: [datatable-help] datatable-help Digest, Vol 28, Issue 2

Yike Lu Thu, 07 Jun 2012 09:11:03 -0700


On 6/7/2012 11:35 AM, Matthew Dowle wrote:


Ah. See previous reply.

??? I understand that i doesn't have to be keyed, but how doesdata.table know when the names are not even the same? Sorry, I don't seethe answer to that in your reply.

By equi-join, do you mean it tests for equality on all possiblecombinations of columns between [i] and [key(X)]? This seems like itcould be rather dangerous.

I tend to get confused by the words 'left' and
'right'. The old SQL syntax *= and =* confused me too. One is left outer
and the other is right outer but which is which and left and right I
always had to do do a test example to check for myself.

I can see where you're coming from. The system I came from had no rightjoin, although it would have been trivial to define one in terms of theleft join. So I got used to remembering "left as in the left table isthe 'larger' or the one without a proper key".


When I see X[Y] I just think, well, X subset by Y. I don't think left and
right. Consider also 3 way joins such as X[Y][Z] vs X[Y[Z]]. X appears
left of Y and Y left of Z in both cases, but they mean different (and each
useful) things.

Yeah, this is where it kind of gets confusing for me. To me, thestandard use of using the left outer is to leave NULL values in order toindicate missingness in the left (unkeyed) table. There are severalpossibilities though. Thinking about it in terms of Y %lj% X <-> X[Y]for a single given row and as a toy example, a single key, set in bothtables:

1) key_value(X) in key_values(Y) - normal case, normal join

2) key_value(X) NOT in key_values(Y) - join X gets filled in with nullsin the join columns

3) key_value(Y) NOT in key_values(X) - the row  in Y gets lost.

So in fact, X and Y can be supersets, subsets, completely disjoint, orpartially overlapping of each other and there would still be a welldefined action.

So it seems both our mental models are somewhat wrong - for me, Y isn'tnecessarily the larger table, for you Y isn't necessarily the subset.

It's more accurate to think that Y's rows don't get touched, or that Yis the table of interest, to which you are adding additional information.

The reason I like thinking in joins is I prefer infix syntax. Moreelegant to me to say setkey(Y, a, b) %lj% setkey(X, a, b) than setkey(X,a, b)[setkey(Y, a, b)], and the chaining versus nested is easy as well...


X[Y][Z] => (X %rj% Y) %rj% Z
X[Y[Z]] => X %rj% (Y %rj% Z)



In X[Y] since X is being subset using Y, the result is all X's columns
followed by the JIS columns from Y (if any).  I'm also trying to use the
names "x" and "i" to refer generically to X[Y] where x=X and i=Y.

merge() is more useful than I've admited in the past. There's a FR to add
by.x and by.y to that, which'll be the way to join by column name when
needed.

I was wondering what the differences/similarities there were to merge().The only place I use it is for the alias outer join: %oj% = function(x,y) merge(x, y)

Matt,

Two things
1) I noticed that in Juliet's example, Y's first column is not named at
all, so defaults to a name of "V1", yet the join carries through. How
does the underlying implementation work here? I believe this was the
original question, and I'm curious as well now.
2) Having re-read the paragraph you sent, I now realize that X[Y] is
equivalent to X right join Y. It may be helpful to explicitly write this
in. I certainly had this flipped (I thought it was X left join Y), and
the column ordering is flipped from standard SQL.

On 6/7/2012 6:00 AM, [email protected]
wrote:

----------------------------------------------------------------------

Message: 1
Date: Wed, 6 Jun 2012 13:47:00 -0400
From: Juliet Hannah<[email protected]>
To: [email protected]
Subject: [datatable-help] join example from faq
Message-ID:
        <CALzuZRQw2QaR=uAfPbmt_R8qnZEhLF7-=_zsoxahxsuhcdc...@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

All,

I am not understanding a few basic things. I am looking at pg 5 of the
faq.

X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
setkey(X,grp)
   Y = data.table(c("b","c"), bar=c(4,2))
   X[Y]

The faq says X[Y] is a join looking up X's rows using Y.

Does this mean data.table looks up X's key using Y?

Y has two columns. How does it know to use the first column in this
example? Y's key has not been set.

Hope my question is not too obvious. :)

Thanks,

Juliet

P.S. Thanks for the wonderful package.  I had to do some aggregations
the other day and my other solutions were
running for hours, but data.table finished in a couple of minutes!

------------------------------

Message: 2
Date: Wed, 06 Jun 2012 23:28:08 +0100
From: Matthew Dowle<[email protected]>
To: Juliet Hannah<[email protected]>
Cc: [email protected]
Subject: Re: [datatable-help] join example from faq
Message-ID:<1339021688.2573.16.camel@netbook>
Content-Type: text/plain; charset="UTF-8"

Hi,

Hopefully this paragraph from ?data.table sheds some light :

    "When i is a data.table, x must have a key. i is joined to x using
the
key and the rows in x that match are returned. An equi-join is performed
between each column in i to each column in x's key. The match is a
binary search in compiled C in O(log n) time. If i has less columns than
x's key then many rows of x may match to each row of i. If i has more
columns than x's key, the columns of i not involved in the join are
included in the result. If i also has a key, it is i's key columns that
are used to match to x's key columns and a binary merge of the two
tables is carried out."

The critical sentence is "If i also has a key ..."; i.e., i doesn't have
to be keyed. Only x must have a key. It's often faster if i is keyed
too, though. There have been some speed improvements in 1.8.1, too.

In the example you highlighted I think it goes on to show 'join
inherited scope', which is this paragraph :

    "Advanced: In the X[Y,j] form of grouping, the j expression sees
variables in X first, then Y. We call this join inherited scope. If the
variable is not in X or Y then the calling frame is searched, its
calling frame, and so on in the usual way up to and including the global
environment."

Encouraging to hear you've reduced hours to minutes. At least someone
knows not to use benchmark(...,replications=100) for tasks that take
under 0.01 seconds and then conclude data.table is slow!

Matthew

On Wed, 2012-06-06 at 13:47 -0400, Juliet Hannah wrote:

All,

I am not understanding a few basic things. I am looking at pg 5 of the
faq.

X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
setkey(X,grp)
   Y = data.table(c("b","c"), bar=c(4,2))
   X[Y]

The faq says X[Y] is a join looking up X's rows using Y.

Does this mean data.table looks up X's key using Y?

Y has two columns. How does it know to use the first column in this
example? Y's key has not been set.

Hope my question is not too obvious. :)

Thanks,

Juliet

P.S. Thanks for the wonderful package.  I had to do some aggregations
the other day and my other solutions were
running for hours, but data.table finished in a couple of minutes!




------------------------------

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

End of datatable-help Digest, Vol 28, Issue 2
*********************************************

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] datatable-help Digest, Vol 28, Issue 2

Reply via email to