Re: [datatable-help] Follow-up on subsetting data.table with NAs

Arunkumar Srinivasan Sun, 07 Jul 2013 00:47:15 -0700

Hello all,

I thought it might be useful to connect a recent post on SO discussing more or 
less the same issue: 
http://stackoverflow.com/questions/17508127/na-in-i-expression-of-data-table-possible-bug


Arun


On Monday, June 10, 2013 at 7:01 PM, Arunkumar Srinivasan wrote:

> Hi Matthew, 
> Thanks for clarifying this. To me the "not join" operation is very similar to 
> "setdiff" operation but for a data.frame/data.table. So DT[!J(.)] could be 
> interpreted as setdiff(DT, DT[J(.)]).
> 
> No, I'm with you in that it makes much sense in extending it to logical 
> vectors operations as well. And so far, I guess all of them who wrote back 
> also agree with the idea of:
> 
> 1) !(x == .) and x != . being identical
> 2) ~(.) (or) NJ(.) (or) -(.) being a NOT JOIN on data.table/list/vectors etc..
> 
> I'd love for these two to be on the feature list. I really don't mind the 
> "~", "NJ" or "-". 
> 
> Thanks again,
> Arun
> 
> 
> On Monday, June 10, 2013 at 5:28 PM, Matthew Dowle wrote:
> 
> >  
> > Hi Arun,
> > Indeed.  ! was introduced for not-join i.e. X[!Y] where i is type 
> > data.table.  Extending it to vectors seemed to make sense at the time; 
> > e.g., X[!"foo"] and X[!3:6] (rather than the X[-3:6] mistake where 
> > X[-(3:6)] was intended) were in my mind.   I think of everything as a join 
> > really; e.g., "where rownumber = i".
> > But I think I'm fine with ! being not-join for data.table/list i only.  Or 
> > is it just logical vector i to be turned off only, and could leave ! as-is 
> > for character and integer vector i?
> > Matthew
> >  
> > On 10.06.2013 15:52, Arunkumar Srinivasan wrote:
> > > Matthew, 
> > > It just occurred to me. I'd be glad if you can clarify this. The 
> > > operation is supposed to be "Not Join". Which means, I'd expect the "!" 
> > > to be used with "J" as in:
> > > dt <- data.table(x=c(0,0,1,1,3), y=1:5)
> > > setkey(dt, "x")
> > > dt[J(c(1,3))] # join
> > >    x y
> > > 1: 1 3
> > > 2: 1 4
> > > 3: 3 5
> > > 
> > > dt[!J(c(1,3))]
> > >    x y
> > > 1: 0 1
> > > 2: 0 2
> > > 
> > > Here the concept of "Not Join" with the use of "!J(.)" makes total sense. 
> > > However, extending it to not-join for logical vectors is what seems to be 
> > > an issue. It's more of a logical indexing than a join (at least in my 
> > > mind). So, if it is possible to distinguish between "!" and "!J" (by 
> > > checking if `i` is a data.table or not) to tell if it's a subsetting by 
> > > logical vector or subsetting by "data.table" and then deciding what to 
> > > do, would that resolve this issue? If not, what's the reason behind using 
> > > "!" as a not-join during logical indexing? Is it still considered as a 
> > > not-join?? 
> > > Just a thought. I hope it makes at least a little sense.
> > > Best,
> > > Arun
> > > 
> > > 
> > > On Monday, June 10, 2013 at 4:35 PM, Matthew Dowle wrote:
> > > 
> > > > Hm, another good point. We need ~ for formulae, although I can't
> > > > imagine a formula in i (only in j). But in both i and j we might want
> > > > to get(x).
> > > > I thought about ^ i.e. X[^Y] in the spirit of regular expression
> > > > syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a
> > > > prefix.
> > > > - maybe then? Consistent with - meaning in R. I don't think I
> > > > actually had a specific use in mind for - and +, to reserve them for,
> > > > but at the time it just seemed a shame to use up one of -/+ without
> > > > defining the other. If - does a not join, then, might + be more like
> > > > merge() (i.e. returning the union of the rows in x and i by join). I
> > > > think I had something like that in mind, but hadn't thought it through.
> > > > Some might say it should be a new argument e.g. notjoin=TRUE, but my
> > > > thinking there is readability, since we often have many lines in i, j
> > > > and by in that order, and if the "notjoin=TRUE" followed afterwards it
> > > > would be far away from the i argument to which it applies. If we
> > > > incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet
> > > > more parameters, too.
> > > > On 10.06.2013 15:02, Gabor Grothendieck wrote:
> > > > > The problem with ~ is that it is using up a special character (of
> > > > > which there are only a few) for a case that does not occur much.
> > > > > I can think of other things that ~ might be better used for. For
> > > > > example, perhaps ~ x could mean get(x). One aspect of data.table
> > > > > that
> > > > > tends to be difficult is when you don't know the variable name ahead
> > > > > of time and this woiuld give a way to specify it concisely.
> > > > > On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
> > > > > <[email protected] (mailto:[email protected])> wrote:
> > > > > > Matthew,
> > > > > > How about ~ instead of ! ? I ruled out - previously to leave +
> > > > > > and -
> > > > > > available for future use. NJ() may be possible too.
> > > > > > Both "NJ()" and "~" are okay for me.
> > > > > > That result makes perfect sense to me. I don't think of !(x==.)
> > > > > > being the
> > > > > > same as x!=. ! is simply a prefix. It's all the rows that
> > > > > > aren't
> > > > > > returned if the ! prefix wasn't there.
> > > > > > I understand that `DT[!(x)]` does what `data.table` is designed to
> > > > > > do
> > > > > > currently. What I failed to mention was that if one were to consider
> > > > > > implementing `!(x==.)` as the same as `x != .` then this behaviour
> > > > > > has to be
> > > > > > changed. Let's forget this point for a moment.
> > > > > > That needs to be fixed. But we're getting quite theoretical here
> > > > > > and far
> > > > > > away from common use cases. Why would we ever have row numbers of
> > > > > > the
> > > > > > table, as a column of the table itself and want to select the rows
> > > > > > by number
> > > > > > not mentioned in that column?
> > > > > > Probably I did not choose a good example. Suppose that I've a
> > > > > > data.table and
> > > > > > I want to get all rows where "x == 0". Let's say:
> > > > > > set.seed(45)
> > > > > > DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
> > > > > > sample(15))
> > > > > > DF <- as.data.frame(DT)
> > > > > > To get all rows where x == 0, it could be done with DT[x == 0]. But
> > > > > > it makes
> > > > > > sense, at least in the context of data.frames, to do equivalently,
> > > > > > DF[!(DF$x), ] (or) DF[DF$x == 0, ]
> > > > > > All I want to say is, I expect `DT[!(x)]` should give the same
> > > > > > result as
> > > > > > `DT[x == 0]` (even though I fully understand it's not the intended
> > > > > > behaviour
> > > > > > of data.table), as it's more intuitive and less confusing.
> > > > > > So, changing `!` to `~` or `NJ` is one half of the issue for me. The
> > > > > > other
> > > > > > is to replace the actual function of `!` in all contexts. I hope I
> > > > > > came
> > > > > > across with what I wanted to say, better this time.
> > > > > > Best,
> > > > > > Arun
> > > > > > On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:
> > > > > > Hi,
> > > > > > How about ~ instead of ! ? I ruled out - previously to leave +
> > > > > > and -
> > > > > > available for future use. NJ() may be possible too.
> > > > > > Matthew
> > > > > > On 10.06.2013 09:35, Arunkumar Srinivasan wrote:
> > > > > > Hi Matthew,
> > > > > > My view (from the last reply) more or less reflects mnel's comments
> > > > > > here:
> > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
> > > > > > Pasted here for convenience:
> > > > > > data.table is mimicing subset in its handling of NA values in
> > > > > > logical i
> > > > > > arguments. -- the only issue is the ! prefix signifying a not-join,
> > > > > > not the
> > > > > > way one might expect. Perhaps the not join prefix could have been NJ
> > > > > > not !
> > > > > > to avoid this confusion -- this might be another discussion to have
> > > > > > on the
> > > > > > mailing list -- (I think it is a discussion worth having)
> > > > > > Arun
> > > > > > On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
> > > > > > Hm, good point. Is data.table consistent with SQL already, for both
> > > > > > == and
> > > > > > !=, and so no change needed?
> > > > > > Yes, I believe it's already consistent with SQL. However, the
> > > > > > current
> > > > > > interpretation of NA (documentation) being treated as FALSE is not
> > > > > > needed /
> > > > > > untrue, imho (Please see below).
> > > > > > And it was correct for Frank to be mistaken.
> > > > > > Yes, it seems like he was mistaken.
> > > > > > Maybe just some more documentation and examples needed then.
> > > > > > It'd be much more appropriate if the documentation reflects the role
> > > > > > of
> > > > > > subsetting in data.table mimicking "subset" function (in order to be
> > > > > > in line
> > > > > > with SQL) by dropping NA evaluated logicals. From a couple of posts
> > > > > > before,
> > > > > > where I pasted the code where NAs are replaced to FALSE were not
> > > > > > necessary
> > > > > > as `irows <- which(i)` makes clear that `which` is being used to get
> > > > > > indices
> > > > > > and then subset, this fits perfectly well with the interpretation of
> > > > > > NA in
> > > > > > data.table.
> > > > > > Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA
> > > > > > inconsistently? :
> > > > > > http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
> > > > > > Ha, I like the idea behind the use of () in evaluating expressions.
> > > > > > It's
> > > > > > another nice layer towards simplicity in data.table. But I still
> > > > > > think there
> > > > > > should not be an inconsistency in equivalent logical operations to
> > > > > > provide
> > > > > > different results. If !(x== .) and x != . are indeed different, then
> > > > > > I'd
> > > > > > suppose replacing `!` with a more appropriate name as it's much
> > > > > > easier to
> > > > > > get confused otherwise.
> > > > > > In essence, either !(x == .) must evaluate to (x != .) if the
> > > > > > underlying
> > > > > > meaning of these are the same, or the `!` in `!(x==.)` must be
> > > > > > replaced to
> > > > > > something that's more appropriate for what it's supposed to be.
> > > > > > Personally,
> > > > > > I prefer the former. It would greatly tighten the structure and
> > > > > > consistency.
> > > > > > "na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch
> > > > > > before
> > > > > > in the context of joins, not logical subsets.
> > > > > > Yes, I find this option would give more control in evaluating
> > > > > > expressions
> > > > > > with ease in `i`, by providing both "subset" (default) and the
> > > > > > typical
> > > > > > data.frame subsetting (na.rm = FALSE).
> > > > > > Best regards,
> > > > > > Arun
> > > > > > _______________________________________________
> > > > > > datatable-help mailing list
> > > > > > [email protected] 
> > > > > > (mailto:[email protected])
> > > > > > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
> > > > > > 
> > > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> > 
> >  
> >  
> > 
> > 
> > 
> 
>

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Follow-up on subsetting data.table with NAs

Reply via email to