Re: [datatable-help] Problem with FAQ 2.8

Matthew Dowle Fri, 07 Jun 2013 05:32:47 -0700

Hi,

Agreed. A change in the software to match the FAQ makes sense. ThatFAQ has non-join columns in mind, for which it is true I believe, butyes it should be true for join columns as well.

The other consideration is rolling joins. With roll=TRUE it is naturalto want to know the staleness of the data joined to. The column namesusually match, say 'date', but the data is different. That was theprimary motivation for i. and x. prefixes:


    X[Y, list(price, daysold = i.date-x.date), roll=TRUE]

i. prefix is already available, but I don't think I did x. yet. Anyway,the 'date' in X 'should' be higher in scope, in compliance with FAQ 2.8,so that this should be the same (although less clear to read since itrelies on the reader knowing FAQ 2.8) :


    X[Y, list(price, daysold = i.date-date), roll=TRUE]

That's less useful now that roll takes a limit, although you stillmight want to know the staleness of data returned within the limit.

I've added a link to this thread to FR 2693 to be addressed. Thanksall.


Matthew


On 07.06.2013 05:50, Gabor Grothendieck wrote:

One correction to my post.  merge() does not include both key columns
in its output; however, that may be less germane because unlike
data.table and SQL one cannot give merge an expression that refers to
them
merge(as.data.frame(d1), as.data.frame(d2), by = 1)
  id1 val val2
1   1   1   11
2   2   2   12
3   2   3   12

The situation with SQLite is as described in my post where both the
id1 and id2 columns are output:
library(sqldf)
sqldf("select * from d1 join d2 on d1.id1 = d2.id2")
  id1 val id2 val2
1   1   1   1   11
2   2   2   2   12
3   2   3   2   12

and one could refer to them as id1 and id2 if they are distinct names
or as d1.id1 and d2.id2 in the select.

One other possibility for data.table would be to change X[Y] so that
in the case of keys with different names both columns appear as inthe
SQL example.  This would presumably also ensure that both could be
referenced in X[Y, j].  however, if the names are the same then there
would be no need to output them both and it would be ok to outputthem
as a single comonly named column.

On Fri, Jun 7, 2013 at 12:34 AM, Gabor Grothendieck
<[email protected]> wrote:
On Thu, Jun 6, 2013 at 11:50 PM, Michael Nelson
<[email protected]> wrote:
This is related to

FR 2693


https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2693&group_id=240&atid=978
What is happening is that the `join` columns must be referencedusing their names as defined in `i` (or Y in X[Y] syntax)
The FAQ doesn't explicitly cover how you are supposed to referencethe columns used in the join.
Perhaps some binding magic could be used to ensure that eithercolumn name could be used. I don't think it is useful want both to bedefined and available as separate objects - -that would mean therewere two copies of something that are identical in value (but notname!)
Note that the FAQ says that the X variables are "in scope".and the
ordinary meaning of being in scope is that such a variable can be
referenced in an unqualified manner so I think it does imply that
these variables can be accessed.
I assume from your response that the answer to my question is thatthe
FAQ is wrong and the behavior is as intended.

If that is the case then it would be desirable that the behavior of
the software be changed to make the FAQ correct.  Having tiny little
exceptions like this that are difficult to remember and error prone
just makes the software harder to use. Another possibility wouldbeto outlaw having keys in X and Y which have different names(although
that would be drastic and inconvenient though safer and easier to
learn then the current situation).

For example, continuing the code in my post here is a second example
consider what would happen if this were to occur:
id1 <- 1
d1[d2, sum(id1 * val)]
   id1 V1
1:   1  1
2:   2  5
3:   4 NA
It would be difficult to realize without close examination thatthereis an error in this code (assuming that the writer intended id1 tobetaken from d1). Here d1$id1 is not in scope (contrary to the FAQ)and
so id1 in the caller is used resulting in wrong output (relative to
the result intended).

Here is another oddity. It seems that in the first case we cannot
access id1 but if we do a join and then access the columns in a
separate [] then we can.

if (exists("id1")) rm(id1)
d1[d2, id1]
Error in `[.data.table`(d1, d2, id1) : object 'id1' not found
d1[d2][, id1]
[1] 1 2 2 4
Note that in R's merge one can refer to both keys and in SQL whenone
does a join one can as well so the behavior we have been discussing
here seems entirely unexpected.

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Problem with FAQ 2.8

Reply via email to