Hm, another good point. We need ~ for formulae, although I can't imagine a formula in i (only in j). But in both i and j we might want to get(x).

I thought about ^ i.e. X[^Y] in the spirit of regular expression syntax, but ^ doesn't parse with a RHS only. Needs to be parsable as a prefix.

- maybe then? Consistent with - meaning in R. I don't think I actually had a specific use in mind for - and +, to reserve them for, but at the time it just seemed a shame to use up one of -/+ without defining the other. If - does a not join, then, might + be more like merge() (i.e. returning the union of the rows in x and i by join). I think I had something like that in mind, but hadn't thought it through.

Some might say it should be a new argument e.g. notjoin=TRUE, but my thinking there is readability, since we often have many lines in i, j and by in that order, and if the "notjoin=TRUE" followed afterwards it would be far away from the i argument to which it applies. If we incorporate merge() into X[Y] using X[+Y] then it might avoid adding yet more parameters, too.


On 10.06.2013 15:02, Gabor Grothendieck wrote:
The problem with ~ is that it is using up a special character (of
which there are only a few) for a case that does not occur much.

I can think of other things that ~ might be better used for.  For
example, perhaps ~ x could mean get(x). One aspect of data.table that
tends to be difficult is when you don't know the variable name ahead
of time and this woiuld give a way to specify it concisely.

On Mon, Jun 10, 2013 at 5:21 AM, Arunkumar Srinivasan
<[email protected]> wrote:
Matthew,

How about ~ instead of ! ? I ruled out - previously to leave + and -
available for future use.  NJ() may be possible too.

Both "NJ()" and "~" are okay for me.

That result makes perfect sense to me. I don't think of !(x==.) being the same as x!=. ! is simply a prefix. It's all the rows that aren't
returned if the ! prefix wasn't there.

I understand that `DT[!(x)]` does what `data.table` is designed to do
currently. What I failed to mention was that if one were to consider
implementing `!(x==.)` as the same as `x != .` then this behaviour has to be
changed. Let's forget this point for a moment.

That needs to be fixed. But we're getting quite theoretical here and far away from common use cases. Why would we ever have row numbers of the table, as a column of the table itself and want to select the rows by number
not mentioned in that column?

Probably I did not choose a good example. Suppose that I've a data.table and
I want to get all rows where "x == 0". Let's say:

set.seed(45)
DT <- data.table( x = sample(c(0,5,10,15), 10, replace=TRUE), y =
sample(15))

DF <- as.data.frame(DT)

To get all rows where x == 0, it could be done with DT[x == 0]. But it makes
sense, at least in the context of data.frames, to do equivalently,

DF[!(DF$x), ] (or) DF[DF$x == 0, ]

All I want to say is, I expect `DT[!(x)]` should give the same result as `DT[x == 0]` (even though I fully understand it's not the intended behaviour
of data.table), as it's more intuitive and less confusing.

So, changing `!` to `~` or `NJ` is one half of the issue for me. The other is to replace the actual function of `!` in all contexts. I hope I came
across with what I wanted to say, better this time.

Best,

Arun


On Monday, June 10, 2013 at 10:52 AM, Matthew Dowle wrote:



Hi,

How about ~ instead of ! ? I ruled out - previously to leave + and -
available for future use.  NJ() may be possible too.

Matthew



On 10.06.2013 09:35, Arunkumar Srinivasan wrote:

Hi Matthew,
My view (from the last reply) more or less reflects mnel's comments here:

http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently#comment23317096_16240143
Pasted here for convenience:
data.table is mimicing subset in its handling of NA values in logical i arguments. -- the only issue is the ! prefix signifying a not-join, not the way one might expect. Perhaps the not join prefix could have been NJ not ! to avoid this confusion -- this might be another discussion to have on the
mailing list -- (I think it is a discussion worth having)

Arun

On Monday, June 10, 2013 at 10:28 AM, Arunkumar Srinivasan wrote:

Hm, good point. Is data.table consistent with SQL already, for both == and
!=, and so no change needed?

Yes, I believe it's already consistent with SQL. However, the current interpretation of NA (documentation) being treated as FALSE is not needed /
untrue, imho (Please see below).


And it was correct for Frank to be mistaken.

Yes, it seems like he was mistaken.

Maybe just some more documentation and examples needed then.

It'd be much more appropriate if the documentation reflects the role of subsetting in data.table mimicking "subset" function (in order to be in line with SQL) by dropping NA evaluated logicals. From a couple of posts before, where I pasted the code where NAs are replaced to FALSE were not necessary as `irows <- which(i)` makes clear that `which` is being used to get indices and then subset, this fits perfectly well with the interpretation of NA in
data.table.

Are you happy that DT[!(x==.)] and DT[x!=.] do treat NA inconsistently? :


http://stackoverflow.com/questions/16239153/dtx-and-dtx-treat-na-in-x-inconsistently

Ha, I like the idea behind the use of () in evaluating expressions. It's another nice layer towards simplicity in data.table. But I still think there should not be an inconsistency in equivalent logical operations to provide different results. If !(x== .) and x != . are indeed different, then I'd suppose replacing `!` with a more appropriate name as it's much easier to
get confused otherwise.
In essence, either !(x == .) must evaluate to (x != .) if the underlying meaning of these are the same, or the `!` in `!(x==.)` must be replaced to something that's more appropriate for what it's supposed to be. Personally, I prefer the former. It would greatly tighten the structure and consistency.

"na.rm = TRUE/FALSE" sounds good to me. I'd only considered nomatch before
in the context of joins, not logical subsets.

Yes, I find this option would give more control in evaluating expressions with ease in `i`, by providing both "subset" (default) and the typical
data.frame subsetting (na.rm = FALSE).
Best regards,

Arun







_______________________________________________
datatable-help mailing list
[email protected]

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to