Re: [datatable-help] Follow-up on subsetting data.table with NAs

Matthew Dowle Mon, 10 Jun 2013 07:44:31 -0700

Hm, another good point. We need ~ for formulae, although I can'timagine a formula in i (only in j). But in both i and j we might wantto get(x).

I thought about ^ i.e. X[^Y] in the spirit of regular expressionsyntax, but ^ doesn't parse with a RHS only. Needs to be parsable as aprefix.

- maybe then? Consistent with - meaning in R. I don't think Iactually had a specific use in mind for - and +, to reserve them for,but at the time it just seemed a shame to use up one of -/+ withoutdefining the other. If - does a not join, then, might + be more likemerge() (i.e. returning the union of the rows in x and i by join). Ithink I had something like that in mind, but hadn't thought it through.

Some might say it should be a new argument e.g. notjoin=TRUE, but mythinking there is readability, since we often have many lines in i, jand by in that order, and if the "notjoin=TRUE" followed afterwards itwould be far away from the i argument to which it applies. If weincorporate merge() into X[Y] using X[+Y] then it might avoid adding yetmore parameters, too.



On 10.06.2013 15:02, Gabor Grothendieck wrote:

The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.

I can think of other things that ~ might be better used for.  For
example, perhaps ~ x could mean get(x). One aspect of data.tablethat
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.

On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
<[email protected]> wrote:
Matthew,
How about ~ instead of ! ? I ruled out - previously to leave +and -
available for future use.  NJ() may be possible too.

Both "NJ()" and "~" are okay for me.
That result makes perfect sense to me. I don't think of !(x==.)being thesame as x!=. ! is simply a prefix. It's all the rows thataren't
returned if the ! prefix wasn't there.
I understand that `DT[!(x)]` does what `data.table` is designed todo
currently. What I failed to mention was that if one were to consider
implementing `!(x==.)` as the same as `x != .` then this behaviourhas to be
changed. Let's forget this point for a moment.
That needs to be fixed. But we're getting quite theoretical hereand faraway from common use cases. Why would we ever have row numbers ofthetable, as a column of the table itself and want to select the rowsby number
not mentioned in that column?
Probably I did not choose a good example. Suppose that I've adata.table and
I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
sample(15))

DF <- as.data.frame(DT)
To get all rows where x == 0, it could be done with DT[x == 0]. Butit makes
sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]
All I want to say is, I expect `DT[!(x)]` should give the sameresult as`DT[x == 0]` (even though I fully understand it's not the intendedbehaviour
of data.table), as it's more intuitive and less confusing.
So, changing `!` to `~` or `NJ` is one half of the issue for me. Theotheris to replace the actual function of `!` in all contexts. I hope Icame
across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:



Hi,
How about ~ instead of ! ? I ruled out - previously to leave +and -
available for future use.  NJ() may be possible too.

Matthew



On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's commentshere:
http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values inlogical iarguments. -- the only issue is the ! prefix signifying a not-join,not theway one might expect. Perhaps the not join prefix could have been NJnot !to avoid this confusion -- this might be another discussion to haveon the
mailing list -- (I think it is a discussion worth having)

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:
Hm, good point. Is data.table consistent with SQL already, for both== and
!=, and so no change needed?
Yes, I believe it's already consistent with SQL. However, thecurrentinterpretation of NA (documentation) being treated as FALSE is notneeded /
untrue, imho (Please see below).


And it was correct for Frank to be mistaken.

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.
It'd be much more appropriate if the documentation reflects the roleofsubsetting in data.table mimicking "subset" function (in order to bein linewith SQL) by dropping NA evaluated logicals. From a couple of postsbefore,where I pasted the code where NAs are replaced to FALSE were notnecessaryas `irows <- which(i)` makes clear that `which` is being used to getindicesand then subset, this fits perfectly well with the interpretation ofNA in
data.table.
Are you happy that DT[!(x==.)] and DT[x!=.] do treat NAinconsistently? :
http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently
Ha, I like the idea behind the use of () in evaluating expressions.It'sanother nice layer towards simplicity in data.table. But I stillthink thereshould not be an inconsistency in equivalent logical operations toprovidedifferent results. If !(x== .) and x != . are indeed different, thenI'dsuppose replacing `!` with a more appropriate name as it's mucheasier to
get confused otherwise.
In essence, either !(x == .) must evaluate to (x != .) if theunderlyingmeaning of these are the same, or the `!` in `!(x==.)` must bereplaced tosomething that's more appropriate for what it's supposed to be.Personally,I prefer the former. It would greatly tighten the structure andconsistency.
"na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatchbefore
in the context of joins, not logical subsets.
Yes, I find this option would give more control in evaluatingexpressionswith ease in `i`, by providing both "subset" (default) and thetypical
data.frame subsetting (na.rm = FALSE).
Best regards,

Arun







_______________________________________________
datatable-help mailing list
[email protected]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Follow-up on subsetting data.table with NAs

Reply via email to