Re: [Rd] [R] Why does R replace all row values with NAs

Hervé Pagès Tue, 03 Mar 2015 14:31:29 -0800


On 03/03/2015 02:17 PM, Gabriel Becker wrote:

Stephanie,

Actually, it's as.logical that isn't preserving matrix dimensions,
because it coerces to a logical vector:

 > x <- matrix(sample(c(NA_integer_, 1:100), 500, replace=TRUE), nrow=50)
 > dim(as.logical(x))


It's true, as.logical() doesn't help here but Stephanie is right, %in%
does not preserve the dimensions either:

> dim(x %in% 1:5)
NULL

That's because match() itself doesn't preserve the dimensions:

> dim(match(x, 1:5))
NULL

So maybe my fast is.true() should be:

is.true <- function(x)
{
  ans <- as.logical(x) %in% TRUE
  if (is.null(dim(x))) {
    names(ans) <- names(x)
  } else {
    dim(ans) <- dim(x)
    dimnames(ans) <- dimnames(x)
  }
  ans
}

or something like that...

H.

NULL

~G

On Tue, Mar 3, 2015 at 2:09 PM, Stephanie M. Gogarten
<sdmor...@u.washington.edu <mailto:sdmor...@u.washington.edu>> wrote:



    On 3/3/15 1:26 PM, Hervé Pagès wrote:



        On 03/03/2015 02:28 AM, Martin Maechler wrote:

            Diverted from R-help :
            .... as it gets into musing about new R language "primitives"

                                William Dunlap <wdun...@tibco.com
                                <mailto:wdun...@tibco.com>>
                                      on Fri, 27 Feb 2015 08:04:36 -0800
                                writes:


                  > You could define functions like

                  > is.true <- function(x) !is.na <http://is.na>(x) & x
                  > is.false <- function(x) !is.na <http://is.na>(x) & !x

                  > and use them in your selections.  E.g.,
                  >> x <-
            data.frame(a=1:10,b=2:11,c=c(__1,NA,3,NA,5,NA,7,NA,NA,10))
                  >> x[is.true(x$c >= 6), ]
                  > a  b  c
                  > 7   7  8  7
                  > 10 10 11 10

                  > Bill Dunlap
                  > TIBCO Software
                  > wdunlap tibco.com <http://tibco.com>

            Yes; the Matrix package has had these

            is0  <- function(x) !is.na <http://is.na>(x) & x == 0
            isN0 <- function(x) is.na <http://is.na>(x) | x != 0
            is1  <- function(x) !is.na <http://is.na>(x) & x   # also ==
            "isTRUE componentwise"


        Note that using %in% to block propagation of NAs is about 2x faster:

          > x <- sample(c(NA_integer_, 1:10000), 500000, replace=TRUE)
          > microbenchmark(as.logical(x) %in% TRUE, !is.na
        <http://is.na>(x) & x)
        Unit: milliseconds
                              expr       min        lq      mean
          median        uq
           as.logical(x) %in% TRUE  6.034744  6.264382  6.999083
        6.29488  6.346028
                     !is.na <http://is.na>(x) & x 11.202808 11.402437
        11.469101 11.44848 11.517576
                max neval
        40.36472 100 <tel:40.36472%20%20%20100>
           11.90916   100


    Unfortunately %in% does not preserve matrix dimensions:

     > x <- matrix(sample(c(NA_integer_, 1:100), 500, replace=TRUE),
    nrow=50)
     > dim(x)
    [1] 50 10
     > dim(!is.na <http://is.na>(x) & x)
    [1] 50 10
     > dim(as.logical(x) %in% TRUE)
    NULL

    Stephanie






            namespace hidden for a while  [note the comment of the last
            one!]
            and using them for readibility in its own code.

            Maybe we should (again) consider providing some versions of
            these with R ?

            The Matrix package also has had fast

            allFalse <- all0 <- function(x) .Call(R_all0, x)
            anyFalse <- any0 <- function(x) .Call(R_any0, x)
            ##
            ## anyFalse <- function(x) isTRUE(any(!x))         ## ~= any0
            ## any0 <- function(x) isTRUE(any(x == 0))          ## ~=
            anyFalse

            namespace hidden as well, already, which probably could also be
            brought to base R.

            One big reason to *not* go there (to internal C code) at all
            with R is
            that
            S3 and S4 dispatch for '==' ('!=', etc, the 'Compare' group
            generics)
            and 'is.na <http://is.na>() have been known and package
            writers have
            programmed methods for these.
            To ensure that S3 and S4 dispatch works "correctly" also inside
            such new internals is much less easily achieved, and so
            such a C-based internal function  is0() would no longer be
            equivalent with    !is.na <http://is.na>(x) & x == 0
            as soon as 'x' is an "object" with a '==', 'Compare' and/or
            an is.na <http://is.na>()
            method.


        Excellent point. Thank you! It really makes a big difference for
        developers who maintain a complex hierarchy of S4 classes and
        methods,
        when functions like is.true, anyFalse, etc..., which can be
        expressed in
        terms of more basic operations like ==, !=, !, is.na
        <http://is.na>, etc..., just work
        out-of-the-box on objects for which these basic operations are
        defined.

        There is conceptually a small set of "building blocks", at least for
        objects with a vector-like or list-like semantic, that can be used
        to formally describe the semantic of many functions in base R. This
        is what the man page for anyNA does by saying:

            anyNA implements any(is.na <http://is.na>(x))

        even though the actual implementation differs, but that's ok, as
        long
        as anyNA is equivalent to doing any(is.na <http://is.na>(x)) on
        any object for which
        building block is.na <http://is.na>() is implemented.

        Unfortunately there is no clearly identified set of building blocks
        in base R. For example, if I want the comparison operations to work
        on my object, I need to implement ==, >, <, !=, <=, and >= (the
        'Compare' group generics) even though it should be enough to
        implement
        == and >=, because all the others can be described in terms of these
        2 building blocks. unique/duplicated is another example
        (unique(x) is
        conceptually x[!duplicated(x)]). And so on...

        Cheers,
        H.


            OTOH, simple R versions such as your  'is.true',  called 'is1'
            inside Matrix maybe optimizable a bit by the byte compiler (and
            jit and other such tricks) and still keep the full
            semantic including correct method dispatch.

            Martin Maechler, ETH Zurich


                  > On Fri, Feb 27, 2015 at 7:27 AM, Dimitri Liakhovitski <
                  > dimitri.liakhovit...@gmail.com
            <mailto:dimitri.liakhovit...@gmail.com>__> wrote:

                  >> Thank you very much, Duncan.
                  >> All this being said:
                  >>
                  >> What would you say is the most elegant and most
            safe way to
            solve such
                  >> a seemingly simple task?
                  >>
                  >> Thank you!
                  >>
                  >> On Fri, Feb 27, 2015 at 10:02 AM, Duncan Murdoch
                  >> <murdoch.dun...@gmail.com
            <mailto:murdoch.dun...@gmail.com>> wrote:
                  >> > On 27/02/2015 9:49 AM, Dimitri Liakhovitski wrote:
                  >> >> So, Duncan, do I understand you correctly:
                  >> >>
                  >> >> When I use x$x<6, R doesn't know if it's TRUE or
            FALSE, so
            it returns
                  >> >> a logical value of NA.
                  >> >
                  >> > Yes, when x$x is NA.  (Though I think you meant x$c.)
                  >> >
                  >> >> When this logical value is applied to a row, the
            R says:
            hell, I don't
                  >> >> know if I should keep it or not, so, just in
            case, I am
            going to keep
                  >> >> it, but I'll replace all the values in this row
            with NAs?
                  >> >
                  >> > Yes.  Indexing with a logical NA is probably a
            mistake, and
            this is one
                  >> > way to signal it without actually triggering a
            warning or
            error.
                  >> >
                  >> > BTW, I should have mentioned that the example
            where you
            indexed using
                  >> > -which(x$c>=6) is a bad idea:  if none of the
            entries were 6
            or more,
                  >> > this would be indexing with an empty vector, and
            you'd get
            nothing, not
                  >> > everything.
                  >> >
                  >> > Duncan Murdoch
                  >> >
                  >> >
                  >> >>
                  >> >> On Fri, Feb 27, 2015 at 9:13 AM, Duncan Murdoch
                  >> >> <murdoch.dun...@gmail.com
            <mailto:murdoch.dun...@gmail.com>> wrote:
                  >> >>> On 27/02/2015 9:04 AM, Dimitri Liakhovitski wrote:
                  >> >>>> I know how to get the output I need, but I
            would benefit
            from an
                  >> >>>> explanation why R behaves the way it does.
                  >> >>>>
                  >> >>>> # I have a data frame x:
                  >> >>>> x =
            data.frame(a=1:10,b=2:11,c=c(__1,NA,3,NA,5,NA,7,NA,NA,10))
                  >> >>>> x
                  >> >>>> # I want to toss rows in x that contain values
             >=6. But I
            don't want
                  >> >>>> to toss my NAs there.
                  >> >>>>
                  >> >>>> subset(x,c<6) # Works correctly, but removes
            NAs in c,
            understand why
                  >> >>>> x[which(x$c<6),] # Works correctly, but
            removes NAs in c,
            understand
                  >> why
                  >> >>>> x[-which(x$c>=6),] # output I need
                  >> >>>>
                  >> >>>> # Here is my question: why does the following line
            replace the values
                  >> >>>> of all rows that contain an NA # in x$c with NAs?
                  >> >>>>
                  >> >>>> x[x$c<6,]  # Leaves rows with c=NA, but makes
            the whole
            row an NA.
                  >> Why???
                  >> >>>> x[(x$c<6) | is.na <http://is.na>(x$c),] #
            output I need - I have to be
                  >> super-explicit
                  >> >>>>
                  >> >>>> Thank you very much!
                  >> >>>
                  >> >>> Most of your examples (except the ones using
            which()) are
            doing logical
                  >> >>> indexing.  In logical indexing, TRUE keeps a
            line, FALSE
            drops the
                  >> line,
                  >> >>> and NA returns NA.  Since "x$c < 6" is NA if
            x$c is NA,
            you get the
                  >> >>> third kind of indexing.
                  >> >>>
                  >> >>> Your last example works because in the cases
            where x$c is
            NA, it
                  >> >>> evaluates NA | TRUE, and that evaluates to
            TRUE.  In the
            cases where
                  >> x$c
                  >> >>> is not NA, you get x$c < 6 | FALSE, and that's
            the same as
            x$c < 6,
                  >> >>> which will be either TRUE or FALSE.
                  >> >>>
                  >> >>> Duncan Murdoch
                  >> >>>
                  >> >>
                  >> >>
                  >> >>
                  >> >
                  >>
                  >>
                  >>
                  >> --
                  >> Dimitri Liakhovitski
                  >>
                  >> ________________________________________________
                  >> r-h...@r-project.org <mailto:r-h...@r-project.org>
            mailing list -- To UNSUBSCRIBE and more, see
                  >> https://stat.ethz.ch/mailman/__listinfo/r-help
            <https://stat.ethz.ch/mailman/listinfo/r-help>
                  >> PLEASE do read the posting guide
                  >> http://www.R-project.org/__posting-guide.html
            <http://www.R-project.org/posting-guide.html>
                  >> and provide commented, minimal, self-contained,
            reproducible
            code.
                  >>

                  > [[alternative HTML version deleted]]

                  > ________________________________________________
                  > r-h...@r-project.org <mailto:r-h...@r-project.org>
            mailing list -- To UNSUBSCRIBE and more, see
                  > https://stat.ethz.ch/mailman/__listinfo/r-help
            <https://stat.ethz.ch/mailman/listinfo/r-help>
                  > PLEASE do read the posting guide
            http://www.R-project.org/__posting-guide.html
            <http://www.R-project.org/posting-guide.html>
                  > and provide commented, minimal, self-contained,
            reproducible code.

            ________________________________________________
            R-devel@r-project.org <mailto:R-devel@r-project.org> mailing
            list
            https://stat.ethz.ch/mailman/__listinfo/r-devel
            <https://stat.ethz.ch/mailman/listinfo/r-devel>



    ________________________________________________
    R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list
    https://stat.ethz.ch/mailman/__listinfo/r-devel
    <https://stat.ethz.ch/mailman/listinfo/r-devel>




--
Gabriel Becker, PhD
Computational Biologist
Bioinformatics and Computational Biology
Genentech, Inc.


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [R] Why does R replace all row values with NAs

Reply via email to