Re: [datatable-help] Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan Mon, 10 Jun 2013 06:39:40 -0700

Frank, 
You're right about my final point. I can't recollect why I wrote that now. I 
guess the `!` function will be restored automatically.


With my second example, all I wanted to establish was that there was another 
reason to change `!` from performing the action of a "Not Join" because 
`DT[!x]` is a perfectly valid syntax (for those who have worked with 
data.frames and have shifted to data.table) which will not perform the intended 
action as it'll be a Not Join. In addition, `DT[!x]` gives an error when "x" 
column has NA. This was meant to be an additional argument for not having `!` 
for Not Join. But this has caused more confusion. Let's forget about my 
examples :).

To conclude, "~" or "NJ" makes sense than `!` for "Not join" and of course the 
function of `!` will be automatically restored to "not" (also preferably with a 
na.rm = TRUE/FALSE. This is what I intended to say from the original 
discussion. Sorry for any confusion. 

Arun


On Monday, June 10, 2013 at 3:20 PM, Frank Erickson wrote:

> +1 to using ~ for the not-join/join on complement/complement then join. 
> Having some logical-looking i's lead to subsetting and others to not-joins 
> can (for me) lead to mistakes that I'm not likely to catch until much later, 
> if at all.
> 
> I'm not sure I follow Arun's second example. If the syntax is changed so that 
> ~ works as ! does now, then presumably !x will be reverted to having only a 
> logical interpretation -- coercing x to logical and taking the subset where x 
> == 0 -- which is the behavior you want. So why is it a separate issue? The 
> remaining difference from data.frames would be that DF[!x] would show NA 
> rows, if any, while DT[!x] would not. 
> 
> --Frank
> 
> 
> On Mon, Jun 10, 2013 at 4:21 AM, Arunkumar Srinivasan <[email protected] 
> (mailto:[email protected])> wrote:
> > Matthew, 
> > 
> > > How about ~ instead of ! ?      I ruled out - previously to leave + and - 
> > > available for future use.  NJ() may be possible too. 
> > Both "NJ()" and "~" are okay for me.
> > 
> > > That result makes perfect sense to me.   I don't think of !(x==.) being 
> > > the same as  x!=.    ! is simply a prefix.    It's all the rows that 
> > > aren't returned if the ! prefix wasn't there.
> > > > 
> > > 
> > > 
> > 
> > 
> > 
> > I understand that `DT[!(x)]` does what `data.table` is designed to do 
> > currently. What I failed to mention was that if one were to consider 
> > implementing `!(x==.)` as the same as `x != .` then this behaviour has to 
> > be changed. Let's forget this point for a moment.
> > 
> > > That needs to be fixed.  But we're getting quite theoretical here and far 
> > > away from common use cases.  Why would we ever have row numbers of the 
> > > table, as a column of the table itself and want to select the rows by 
> > > number not mentioned in that column?
> > > 
> > 
> > 
> > Probably I did not choose a good example. Suppose that I've a data.table 
> > and I want to get all rows where "x == 0". Let's say:
> > 
> > set.seed(45)
> > DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y = 
> > sample(15)) 
> > 
> > DF <- as.data.frame(DT)
> > 
> > 
> > 
> > To get all rows where x == 0, it could be done with DT[x == 0]. But it 
> > makes sense, at least in the context of data.frames, to do equivalently,
> > 
> > DF[!(DF$x), ] (or) DF[DF$x == 0, ]
> > 
> > All I want to say is, I expect `DT[!(x)]` should give the same result as 
> > `DT[x == 0]` (even though I fully understand it's not the intended 
> > behaviour of data.table), as it's more intuitive and less confusing.  
> > 
> > So, changing `!` to `~` or `NJ` is one half of the issue for me. The other 
> > is to replace the actual function of `!` in all contexts. I hope I came 
> > across with what I wanted to say, better this time. 
> > 
> > Best,
> > 
> > Arun
> > 
> > 
> > 
> > 
> > On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
> > 
> > >  
> > > Hi,
> > > How about ~ instead of ! ?      I ruled out - previously to leave + and - 
> > > available for future use.  NJ() may be possible too.
> > > Matthew
> > >  
> > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
> > > > Hi Matthew,
> > > > My view (from the last reply) more or less reflects mnel's comments 
> > > > here: 
> > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
> > > >  
> > > > Pasted here for convenience:
> > > > data.table is mimicing subset in its handling of NA values in logical i 
> > > > arguments. -- the only issue is the ! prefix signifying a not-join, not 
> > > > the way one might expect. Perhaps the not join prefix could have been 
> > > > NJ not ! to avoid this confusion -- this might be another discussion to 
> > > > have on the mailing list -- (I think it is a discussion worth having) 
> > > > 
> > > > Arun 
> > > > 
> > > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
> > > > 
> > > > > > Hm, good point.  Is data.table consistent with SQL already, for 
> > > > > > both == and !=, and so no change needed?  
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Yes, I believe it's already consistent with SQL. However, the current 
> > > > > interpretation of NA (documentation) being treated as FALSE is not 
> > > > > needed / untrue, imho (Please see below).
> > > > >  
> > > > > > And it was correct for Frank to be mistaken.  
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Yes, it seems like he was mistaken.
> > > > > > Maybe just some more documentation and examples needed then.
> > > > > > 
> > > > > > 
> > > > > 
> > > > > It'd be much more appropriate if the documentation reflects the role 
> > > > > of subsetting in data.table mimicking "subset" function (in order to 
> > > > > be in line with SQL) by dropping NA evaluated logicals. From a couple 
> > > > > of posts before, where I pasted the code where NAs are replaced to 
> > > > > FALSE were not necessary as `irows <- which(i)` makes clear that 
> > > > > `which` is being used to get indices and then subset, this fits 
> > > > > perfectly well with the interpretation of NA in data.table. 
> > > > > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA 
> > > > > > inconsistently? :
> > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
> > > > > > 
> > > > > > 
> > > > > 
> > > > >  Ha, I like the idea behind the use of () in evaluating expressions. 
> > > > > It's another nice layer towards simplicity in data.table. But I still 
> > > > > think there should not be an inconsistency in equivalent logical 
> > > > > operations to provide different results. If !(x== .) and x != . are 
> > > > > indeed different, then I'd suppose replacing `!` with a more 
> > > > > appropriate name as it's much easier to get confused otherwise. 
> > > > > In essence, either !(x == .) must evaluate to (x != .) if the 
> > > > > underlying meaning of these are the same, or the `!` in `!(x==.)` 
> > > > > must be replaced to something that's more appropriate for what it's 
> > > > > supposed to be. Personally, I prefer the former. It would greatly 
> > > > > tighten the structure and consistency.
> > > > > > "na.rm = TRUE/FALSE" sounds good to me.  I'd only considered 
> > > > > > nomatch before in the context of joins, not logical subsets.
> > > > > > 
> > > > > > 
> > > > > 
> > > > > Yes, I find this option would give more control in evaluating 
> > > > > expressions with ease in `i`, by providing both "subset" (default) 
> > > > > and the typical data.frame subsetting (na.rm = FALSE).
> > > > > Best regards,
> > > > >  
> > > > > Arun
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > 
> > >  
> > >  
> > > 
> > > 
> > > 
> > 
> > 
> > 
> > _______________________________________________
> > datatable-help mailing list
> > [email protected] 
> > (mailto:[email protected])
> > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
> _______________________________________________
> datatable-help mailing list
> [email protected] 
> (mailto:[email protected])
> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> 
>

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Follow-up on subsetting data.table with NAs

Reply via email to