[datatable-help] Adding a column in place and join surprises

David Kulp Wed, 01 Aug 2012 15:05:05 -0700

Suppose I have a data.table with a unique identifier and value and another 
data.table with a cross-reference to that identifier and lots of other 
measurement data.  E.g. suppose my lookup table is "locs" and my measurement 
data is "obsv" as follows:


obsv=data.table(id=1:7, loc=c(10,20,10,10,30,10,20), mvar=rnorm(7), key='id')
locs=data.table(loc=c(30,20,10),name=c("foo","bar","baz"), other=letters[1:3], 
key='loc')

I simply want to add the 'name' column from locs to the obsv table using :=.  
But this quickly becomes really complicated because (1) the keys for the two 
data.tables differ (appropriately), (2) the key for locs is an integer, and (3) 
the return columns of a join always include the matching columns.

First of all, the gotcha is that locs[obsv[,loc]] doesn't work.  This is 
because obsv[,loc] returns a numeric column, which is treated as indexing the 
row numbers.  Surprise!  

> locs[obsv[,loc]] 
     loc name other
[1,]  NA <NA>  <NA>
[2,]  NA <NA>  <NA>
[3,]  NA <NA>  <NA>
[4,]  NA <NA>  <NA>
[5,]  NA <NA>  <NA>
[6,]  NA <NA>  <NA>
[7,]  NA <NA>  <NA>

This unexpected and silent behavior could easily cause very bad results.  (When 
I first did this test I used 1,2 and 3 instead of 10,20 and 30 and everything 
seemed to work!)  I think this inconsistency should be addressed.  For example, 
consider modifying joins so that they only happen when the i argument is a 
data.table or a list.  If it is a character, then it should fail.  Part of the 
problem here is the inconsistency that A[,col1] returns a vector of characters, 
but A[,list(col1,col2)] returns a data.table.  If instead, data.tables were 
always returned unless, say, a simplify=TRUE argument was provided, then we'd 
be in better shape because locs[obsv[,loc]] would always be a join and 
locs[obsv[,loc,simplify=TRUE]] would be a row retrieval as for data.frame.

Anyway, the solution to the above seems to be to create a list object for i:

> locs[list(obsv[,loc])]
Error in `[.data.table`(locs, list(obsv[, loc])) : 
  typeof x.loc (integer) != typeof i.V1 (double)

but that doesn't work because obsv$loc is class numeric and locs$loc is class 
integer.  This is because locs$loc is silently changed to integer when the key 
is set.  So, to perform a lookup we need to coerce to integer as follows:

> locs[list(as.integer(obsv[,loc]))]
     loc name other
[1,]   1  baz     c
[2,]   2  bar     b
[3,]   1  baz     c
[4,]   1  baz     c
[5,]   3  foo     a
[6,]   1  baz     c
[7,]   2  bar     b

But if I want a right-hand-side value for assignment, then I need just a single 
column.  However, there is an inconsistency: a join adds the extra join column 
names even if I specify only the column I want.

> locs[list(as.integer(obsv[,loc])),name]
     loc name
[1,]   1  baz
[2,]   2  bar
[3,]   1  baz
[4,]   1  baz
[5,]   3  foo
[6,]   1  baz
[7,]   2  bar

In my opinion, the j column should always return only the columns that the user 
requested.  If the user wants additional columns (possibly from the join) then 
they can be requested explicitly.  Anyway, the workaround for this is to make a 
chain statement:

> locs[list(as.integer(obsv[,loc]))][,name]
[1] "baz" "bar" "baz" "baz" "foo" "baz" "bar"

That looks good!  Now to assign using := I write

> obsv[,locname:=locs[list(as.integer(obsv[,loc]))][,name]]
     id loc       mvar locname
[1,]  1   1 -1.8721465     baz
[2,]  2   2 -0.5341284     bar
[3,]  3   1 -2.2243473     baz
[4,]  4   1 -0.3093871     baz
[5,]  5   3 -0.7879228     foo
[6,]  6   1 -0.5672113     baz
[7,]  7   2  1.0390285     bar

I would argue that this is surprisingly obtuse and not intuitive.  Things would 
get even more complicated if locs had a 2-column key!  

I suppose one reply is that I should just temporarily set the key for obsv and 
then reassign the entire obsv data.table.  I.e.,

> setkey(obsv,loc)
> obsv=locs[obsv]
> setkey(obsv,id)

This works, but is somehow to my eyes particularly dissatisfying.  Keys must be 
reset twice.  Potentially large datasets must be reassigned in their entirety.  
Another solution that performs in-place assignment is similar:

> setkey(obsv,loc)
> obsv[,locname:=locs[obsv][,name]]
     id loc       mvar locname
[1,]  1   1 -0.6648842     baz
[2,]  3   1 -0.4477593     baz
[3,]  4   1 -1.1300506     baz
[4,]  6   1 -0.3041305     baz
[5,]  2   2 -0.8239177     bar
[6,]  7   2 -0.3416380     bar
[7,]  5   3  1.2745693     foo
> setkey(obsv,id)

This is not so bad, but it would be a lot nicer to not have to set keys and to 
simply say:

> obsv[,locname := locs[obsv,name]]

This could be achieved if (1) joins were performed by matching commonly named 
columns (like an SQL natural join) if the the two tables did not share keys of 
the same cardinality and types and (2) only explicitly listed columns were 
returned.  In my opinion, this idea of "natural joins" based on column names 
would simplify joins a lot, while making them more generally useful and 
intuitive.  If column names differed, then you might specify a list instead, 
e.g.

> A[list(id=B$a_id), val]

or maybe specify the mapping as an optional parameter that could be used if A 
and B did not have common columns and if A and B's keys differed, e.g.

> A[B, val, map=c("id=a_id")]

If joins matched by name, then the implementation could check if the key was 
sufficiently satisfied to be used and otherwise it would just perform a more 
conventional non-key'd join.

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

[datatable-help] Adding a column in place and join surprises

Reply via email to