Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

Sadao Milberg Fri, 26 Apr 2013 06:34:47 -0700

Your suggestion for transition seems reasonable, although I still think you 
should just use a new argument rather than try to change the behavior of by.  
The most natural thing seems to leave Y as the `i` value, since after all, we 
are still joining on the key, and then just modify the standard join behavior 
with the cross.apply=TRUE or some such.

This way, you avoid having to have a more complicated description of the `by` 
argument, where all of a sudden it means 'group by these expressions, unless 
you use the special expression .XXX, in which case something confusingly 
similar yet different happens, oh, and by the way, you can only use .XXX if you 
are also using i=Y' (and what does by=list(a, .JOIN) do?).  To some extent your 
final proposal of by=Y is a little better, but still confusing since now you're 
using by to join and group, when it's `i` job to do that.

Loosely related, what does .JOIN represent?  Is it just a flag, or is it a 
derived variable the way .SD is?  If it's just a flag, it seems like a bad idea 
to use a name to represent it since that is a break from the meaning of all the 
other .X variables in data.table, which actually contain some kind of 
derivative data.

Finally, when you say "might help in a few related areas e.g. X[Y][,j] (which 
isn't great right now, agreed)", do you mean joint inherited scope will work 
even when we're not in by-without-by mode?  That would be great.

S.

Date: Fri, 26 Apr 2013 12:14:02 +0100
From: [email protected]
To: [email protected]
CC: [email protected]
Subject: Re: [datatable-help] changing data.table by-without-by syntax to 
require a "by"

I didn't get any feedback off list on this one.

But I'm coming round to the idea.

What about by=.JOIN   (is that you were thinking .J stood for?)  Other 
possibilties: .EACHI, .IROW, .EACHIROW, .CROSSAPPLY, .EACHJOIN.  Just to 
brainstorm it.

by=.JOIN could be added anyway with no backwards compatibility issues, so that 
those who wished to be explicit now could be.

To change the default for X[Y, j] I'm also coming round to.   It might help in 
a few related areas e.g. X[Y][,j] (which isn't great right now, agreed).  We 
have successfully made non-backwards-compatibile changes in the past by 
introducing a global option which we slowly migrate to.  If 
datatable.bywithoutby was added it could take values  TRUE|"warning"|FALSE from 
day one, with default TRUE.  That allows those who wish for explicit by to 
migrate straight away by changing the default to FALSE.  Existing users could 
set it to "warning" to see how many implicit bywithoutby they have.   Those 
calls can gradually be changed to by=.JOIN and in that way both implicit and 
explicit work at the same time,   for say a year,   with full backwards 
compatibility by default. This approach allows a slow and flexible migration 
path on a per feature basis.   Then the default could be chaged to "warning"  
before finally FALSE.     Depending on how it goes,  the option could be left 
there to allow TRUE if anyone wanted it,  or removed (maybe after two years).   
Similar to the removal of J() outside DT[...] i.e. users can still now very 
easily write J=data.table in their .Rprofile if they wish, for backwards 
compatibility.

Or ... instead of :

    X[Y, j, by=.JOIN]

what about :

    X[by=Y, j]

Matthew

On 25.04.2013 16:32, Matthew Dowle wrote:

I'd appreciate some input from others whether they agree or not.   If you have 
a view perhaps let me know off list,  or on list, whichever you prefer.

Thanks,

Matthew

On 25.04.2013 13:45, Eduard Antonyan wrote:

Well, so can .I or .N or .GRP or .BY, yet those are used as special names, 
which is exactly why I suggested .J.
The problem with using 'missingness' is that it already means smth very 
different when i is not a join/cross, it means *don't* do a by, thus 
introducing the whole case thing one has to through in their head every time as 
in OP (which of course becomes automatic after a while, but it's a cost 
nonetheless, which is in particular high for new people). So I see absence of 
'by' as an already taken and used signal and thus something else has to be used 
for the new signal of cross apply (it doesn't have to be the specific option I 
mentioned above). This is exactly why I find optional turning off of this 
behavior unsatisfactory, and I don't see that as a solution to this at all. 
I think in the x+y context the appropriate analog is - what if that added x and 
y normally, but when x and y were data.frames it did element by element 
multiplication instead? Yes that's possible to do, and possible to document, 
but it's not a good idea, because it takes place of adding them element by 
element. The recycling behavior doesn't do that - what that does is it says it 
doesn't really make sense to add them as is, but we can do that after 
recycling, so let's recycle. It doesn't take the place of another existing way 
of adding vectors. 

On Apr 25, 2013, at 4:28 AM, Matthew Dowle <[email protected]> wrote:

I see what you're getting at. But .J may be a column name, which is the current 
meaning of by = single symbol. And why .J?  If not .J, or any single symbol 
what else instead?  A character value such as by="irows" is taken to mean the 
"irows" column currently (for consistency with by="colA,colB,colC").  But some 
signal needs to be passed to by=, then (you're suggesting), to trigger the 
cross apply by each i row.  Currently, that signal is missingness  (which I 
like, rely on, and use with join inherited scope).

As I wrote in the S.O. thread,  I'm happy to make it optional (i.e. an option 
to turn off by-without-by), since there is no downside.   But you've continued 
to argue for a change to the default, iiuc.

Maybe it helps to consider :

     x+y

Fundamentally in R this depends on what x and y are.  Most of us probably 
assume (as a first thought) that x and y are vectors and know that this will 
apply "+" elementwise,  recycling y if necessary.  In R we like and write code 
like this all the time.   I think of X[Y, j] in the same way: j is the 
operation (like +) which is applied for each row of Y.   If you need j for the 
entire set that Y joins to,  then like a FAQ says,  make j missing too and it's 
X[Y][,j]. But providing a way to make X[Y,j] do the same as X[Y][,j] would be 
nice and is on the list:  drop=TRUE would do that (as someone mentioned on the 
S.O. thread).  So maybe the new option would be datatable.drop (but with 
default FALSE not TRUE).  If you wanted to turn off by-without-by you might set 
options(datatable.drop=TRUE). Then you can use data.table how you prefer 
(explicit by) and I can use it how I prefer.

I'm happy to add the argument to [.data.table,  and make its default changeable 
via a global option in the usual way. 
Matthew

On 25.04.2013 05:16, Eduard Antonyan wrote:

That's really interesting, I can't currently think of another way of doing that 
as after X[Y] is done the necessary information is lost. 
To retain that functionality and achieve better readability, as in OP, I think 
smth along the lines of X[Y, head(.SD, i.top), by=.J] would be a good 
replacement for current syntax.

On Apr 24, 2013, at 6:01 PM, Eduard Antonyan <[email protected]> wrote:

that's an interesting example - I didn't realize current behavior would do 
that, I'm not at a PC anymore but I'll definitely think about it and report 
back, as it's not immediately obvious to me

On Wed, Apr 24, 2013 at 5:50 PM, Matthew Dowle <[email protected]> wrote:

i. prefix is just a robust way to reference join inherited columns:   the 'top' 
column in the i table.   Like table aliases in SQL.

What about this? :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
 a b
 1: 1 1
 2: 1 4
 3: 1 7
 4: 1 10
 5: 1 13
 6: 2 2
 7: 2 5
 8: 2 8
 9: 2 11
10: 2 14
 11: 3 3
12: 3 6
 13: 3 9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2,1), top=c(3,4,2))

1> Y
 a top
1: 1 3
2: 2 4
3: 1 2
1> X[Y, head(.SD,i.top)]
 a b
1: 1 1
2: 1 4
3: 1 7
4: 2 2
5: 2 5
6: 2 8
7: 2 11
8: 1 1
9: 1  4
1> 

On 24.04.2013 23:43, Eduard Antonyan wrote:

I assumed they meant create a table :)
that looks cool, what's i.top ? I can get a very similar to yours result by 
writing:
X[Y][, head(.SD, top[1]), by = a]
and I probably would want the following to produce your result (this might 
depend a little on what exactly i.top is):
X[Y, head(.SD, i.top), by = a]

On Wed, Apr 24, 2013 at 5:28 PM, Matthew Dowle <[email protected]> wrote:

That sentence on that linked webpage seems incorect English, since table is a 
noun not a verb.  Should "table" be "join" perhaps?

Anyway, by-without-by is often used with join inherited scope (JIS).  For 
example, translating their example :

1> X = data.table(a=1:3,b=1:15, key="a")
1> X
    a  b
 1: 1  1
 2: 1  4
 3: 1  7
 4: 1 10
 5: 1 13
 6: 2  2
 7: 2  5
 8: 2  8
 9: 2 11
10: 2 14
11: 3  3
12: 3  6

13: 3  9
14: 3 12
15: 3 15
1> Y = data.table(a=c(1,2), top=c(3,4))
1> Y
   a top
1: 1   3
2: 2   4
1> X[Y, head(.SD,i.top)]
   a  b
1: 1  1
2: 1  4
3: 1  7
4: 2  2
5: 2  5

6: 2  8
7: 2 11
1> 

If there was no by-without-by (analogous to CROSS BY),  then how would that be 
done?

On 24.04.2013 22:22, Eduard Antonyan wrote:

By that you mean current behavior? You'd get current behavior by explicitly 
specifying the appropriate "by" (i.e. "by" equal to the key).
Btw, I'm trying to understand SQL CROSS APPLY vs JOIN using 
http://explainextended.com/2009/07/16/inner-join-vs-cross-apply/, and I can't 
figure out how by-without-by (or with by-with-by for that matter:) ) helps with 
e.g. the first example there:
"We table table1 and table2. table1 has a column called rowcount.
For each row from table1 we need to select first rowcount rows from table2, 
ordered by table2.id"

On Wed, Apr 24, 2013 at 4:01 PM, Matthew Dowle <[email protected]> wrote:

But then what would be analogous to CROSS APPLY in SQL?

 > I'd agree with Eduard, although it's probably too late to change behavior
 > now.  Maybe for data.table.2?  Eduard's proposal seems more closely
 > aligned with SQL behavior as well (SELECT/JOIN, then GROUP, but only if
 > requested).
 >
 > S.
 >
 >> Date: Mon, 22 Apr 2013 08:17:59 -0700
>> From: [email protected]
 >> To: [email protected]

>> Subject: Re: [datatable-help] changing data.table by-without-by
 >> syntax       to      require a "by"
 >>
 >> I think you're missing the point Michael. Just because it's possible to
 >> do it
 >> the way it's done now, doesn't mean that's the best way, as I've tried
 >> to
 >> argue in the OP. I don't think you've addressed the issue of unnecessary
 >> complexity pointed out in OP.
 >>
 >>
 >>
 >> --
 >> View this message in context:
 >> http://r.789695.n4.nabble.com/changing-data-table-by-without-by-syntax-to-require-a-by-tp4664770p4664990.html
 >> Sent from the datatable-help mailing list archive at Nabble.com.
 >> _______________________________________________
 >> datatable-help mailing list
>> [email protected]

>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
 >                                         
 > _______________________________________________
 > datatable-help mailing list
> [email protected]
 > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] changing data.table by-without-by syntax to require a "by"

Reply via email to