On 6/7/2012 11:35 AM, Matthew Dowle wrote:

Ah. See previous reply.
??? I understand that i doesn't have to be keyed, but how does data.table know when the names are not even the same? Sorry, I don't see the answer to that in your reply.

By equi-join, do you mean it tests for equality on all possible combinations of columns between [i] and [key(X)]? This seems like it could be rather dangerous.

I tend to get confused by the words 'left' and
'right'. The old SQL syntax *= and =* confused me too. One is left outer
and the other is right outer but which is which and left and right I
always had to do do a test example to check for myself.
I can see where you're coming from. The system I came from had no right join, although it would have been trivial to define one in terms of the left join. So I got used to remembering "left as in the left table is the 'larger' or the one without a proper key".

When I see X[Y] I just think, well, X subset by Y. I don't think left and
right. Consider also 3 way joins such as X[Y][Z] vs X[Y[Z]]. X appears
left of Y and Y left of Z in both cases, but they mean different (and each
useful) things.
Yeah, this is where it kind of gets confusing for me. To me, the standard use of using the left outer is to leave NULL values in order to indicate missingness in the left (unkeyed) table. There are several possibilities though. Thinking about it in terms of Y %lj% X <-> X[Y] for a single given row and as a toy example, a single key, set in both tables:
1) key_value(X) in key_values(Y) - normal case, normal join
2) key_value(X) NOT in key_values(Y) - join X gets filled in with nulls in the join columns
3) key_value(Y) NOT in key_values(X) - the row  in Y gets lost.

So in fact, X and Y can be supersets, subsets, completely disjoint, or partially overlapping of each other and there would still be a well defined action.

So it seems both our mental models are somewhat wrong - for me, Y isn't necessarily the larger table, for you Y isn't necessarily the subset.

It's more accurate to think that Y's rows don't get touched, or that Y is the table of interest, to which you are adding additional information.

The reason I like thinking in joins is I prefer infix syntax. More elegant to me to say setkey(Y, a, b) %lj% setkey(X, a, b) than setkey(X, a, b)[setkey(Y, a, b)], and the chaining versus nested is easy as well...

X[Y][Z] => (X %rj% Y) %rj% Z
X[Y[Z]] => X %rj% (Y %rj% Z)


In X[Y] since X is being subset using Y, the result is all X's columns
followed by the JIS columns from Y (if any).  I'm also trying to use the
names "x" and "i" to refer generically to X[Y] where x=X and i=Y.

merge() is more useful than I've admited in the past. There's a FR to add
by.x and by.y to that, which'll be the way to join by column name when
needed.
I was wondering what the differences/similarities there were to merge(). The only place I use it is for the alias outer join: %oj% = function(x, y) merge(x, y)

Matt,

Two things
1) I noticed that in Juliet's example, Y's first column is not named at
all, so defaults to a name of "V1", yet the join carries through. How
does the underlying implementation work here? I believe this was the
original question, and I'm curious as well now.
2) Having re-read the paragraph you sent, I now realize that X[Y] is
equivalent to X right join Y. It may be helpful to explicitly write this
in. I certainly had this flipped (I thought it was X left join Y), and
the column ordering is flipped from standard SQL.

On 6/7/2012 6:00 AM, [email protected]
wrote:

----------------------------------------------------------------------

Message: 1
Date: Wed, 6 Jun 2012 13:47:00 -0400
From: Juliet Hannah<[email protected]>
To: [email protected]
Subject: [datatable-help] join example from faq
Message-ID:
        <CALzuZRQw2QaR=uAfPbmt_R8qnZEhLF7-=_zsoxahxsuhcdc...@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

All,

I am not understanding a few basic things. I am looking at pg 5 of the
faq.

X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
setkey(X,grp)
   Y = data.table(c("b","c"), bar=c(4,2))
   X[Y]

The faq says X[Y] is a join looking up X's rows using Y.

Does this mean data.table looks up X's key using Y?

Y has two columns. How does it know to use the first column in this
example? Y's key has not been set.

Hope my question is not too obvious. :)

Thanks,

Juliet

P.S. Thanks for the wonderful package.  I had to do some aggregations
the other day and my other solutions were
running for hours, but data.table finished in a couple of minutes!


------------------------------

Message: 2
Date: Wed, 06 Jun 2012 23:28:08 +0100
From: Matthew Dowle<[email protected]>
To: Juliet Hannah<[email protected]>
Cc: [email protected]
Subject: Re: [datatable-help] join example from faq
Message-ID:<1339021688.2573.16.camel@netbook>
Content-Type: text/plain; charset="UTF-8"

Hi,

Hopefully this paragraph from ?data.table sheds some light :

    "When i is a data.table, x must have a key. i is joined to x using
the
key and the rows in x that match are returned. An equi-join is performed
between each column in i to each column in x's key. The match is a
binary search in compiled C in O(log n) time. If i has less columns than
x's key then many rows of x may match to each row of i. If i has more
columns than x's key, the columns of i not involved in the join are
included in the result. If i also has a key, it is i's key columns that
are used to match to x's key columns and a binary merge of the two
tables is carried out."

The critical sentence is "If i also has a key ..."; i.e., i doesn't have
to be keyed. Only x must have a key. It's often faster if i is keyed
too, though. There have been some speed improvements in 1.8.1, too.

In the example you highlighted I think it goes on to show 'join
inherited scope', which is this paragraph :

    "Advanced: In the X[Y,j] form of grouping, the j expression sees
variables in X first, then Y. We call this join inherited scope. If the
variable is not in X or Y then the calling frame is searched, its
calling frame, and so on in the usual way up to and including the global
environment."

Encouraging to hear you've reduced hours to minutes. At least someone
knows not to use benchmark(...,replications=100) for tasks that take
under 0.01 seconds and then conclude data.table is slow!

Matthew

On Wed, 2012-06-06 at 13:47 -0400, Juliet Hannah wrote:
All,

I am not understanding a few basic things. I am looking at pg 5 of the
faq.

X = data.table(grp=c("a","a","b","b","b","c","c"), foo=1:7)
setkey(X,grp)
   Y = data.table(c("b","c"), bar=c(4,2))
   X[Y]

The faq says X[Y] is a join looking up X's rows using Y.

Does this mean data.table looks up X's key using Y?

Y has two columns. How does it know to use the first column in this
example? Y's key has not been set.

Hope my question is not too obvious. :)

Thanks,

Juliet

P.S. Thanks for the wonderful package.  I had to do some aggregations
the other day and my other solutions were
running for hours, but data.table finished in a couple of minutes!



------------------------------

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

End of datatable-help Digest, Vol 28, Issue 2
*********************************************


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to