Re: [Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Hervé Pagès Tue, 21 May 2013 14:35:09 -0700


On 05/21/2013 08:00 AM, Michael Lawrence wrote:




On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpa...@fhcrc.org
<mailto:hpa...@fhcrc.org>> wrote:

    Michael,


    On 05/20/2013 08:44 PM, Michael Lawrence wrote:




        On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpa...@fhcrc.org
        <mailto:hpa...@fhcrc.org>
        <mailto:hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>>> wrote:

             Hi,


             On 05/20/2013 03:15 PM, Michael Lawrence wrote:

                 On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
                 <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
        <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>> wrote:

                     Hi Michael --


                     On 5/20/2013 5:56 AM, Michael Lawrence wrote:

                         While it's cool that seqlevels<- has become so
        flexible,
                         I still claim
                         that
                         rename/keep/drop would be a lot easier to read,
        because
                         they describe the
                         high-level operation, and there's no reason to
        deprecate
                         them. They're
                         also
                         easier to remember. For example, for dropping with
                         seqlevels<-, the user
                         needs
                         to remember that "force" is necessary to drop. Much
                         easier to just say
                         "dropSeqlevels(), please". And reimplementing
                         keepSeqlevels is still too
                         complicated. Is it such a maintenance burden to
        have
                         these simple
                         wrappers sit
                         on top of the low-level manipulators?

                         Another suggestion: renameSeqlevels should not
        require a
                         named vector (in
                         cases
                         where it is unnamed, require that it is
        parallel to the
                         existing
                         seqlevels, as
                         with seqlevels<-).


                     I didn't really indicate what drove my desire to see
                     keepSeqlevels
                     deprecated. keepSeqlevels, seqlevels<-, and
        isActiveSeq were
                     developed more
                     or less independently, and have different
        contracts. The
                     different
                     contracts are confusing to the user, as is creating
        a usable
                     help system
                     when there are multiple end points for what is a common
                     operation. The help
                     pages of each were inadequate in different ways.
        And because
                     the code was
                     developed independently, support for different
        objects was
                     inconsistent. So
                     actually, yes, the maintenance (and use) burden for the
                     previous state of
                     affairs was substantial.

                     On the other hand, I agree that keepSeqlevels is
        convenient
                     as a simple
                     wrapper around seqlevels<-, in the same way that
        setNames
                     and names<- are
                     both useful.

                     So we could iterate to

                         keepSeqlevels <-
                             function(x, value, ...)
                         {
                             seqlevels(x, force=TRUE) <- value
                             x
                         }

                     but I'd be less enthusiastic about maintaining the
        original
                     contract of
                     keepSeqlevels, where keepSeqlevels(gr1, gr2), would
        have
                     worked for two
                     GRanges objects.


             Why would this be called keepSeqlevels() if, depending on
        how it's used,
             it will either add, drop, rename, or permute the seqlevels?
             Couldn't this be called setSeqlevels?


        I thought that new2old was necessary for permuting.


    The seqlevels setter has no 'new2old' arg.


        You're right that
        adding should be disallowed for keepSeqlevels(). Adding
        seqlevels is not
        a common operation. The two common operations are:
        * Permuting, usually because the data were imported in non-canonical
        order (seqnameStyle was designed to address this, no?).


    Permuting is straightforward with seqlevels<-:

       > seqlevels(gr)
       [1] "chr1" "chr2" "chr3"

       > seqlevels(gr) <- rev(seqlevels(gr))

       > seqlevels(gr)
       [1] "chr3" "chr2" "chr1"


        * Subsetting, either by keeping or dropping.


    Also straightforward:

       > seqlevels(gr, force=TRUE) <- "chr2"

       > seqlevels(gr)
       [1] "chr2"


        Two main reasons: taking a
        small slice of the data for prototyping, or removing problematic
        chromosomes (sex, circular). This goes back to bsapply and the
        exclude
        argument.


    There are other important use cases:

       - Dropping seqlevels that are *not* in use. This happens for example
         when subsetting a BAM file with Rsamtools::filterBam to keep only
         the alignments located on a given chromosome. The entire sequence
         dictionary (in the header) is propagated to the resulting BAM file
         so when you read the file with readGAlignments() you end up with a
         bunch of useless seqlevels that you generally start by dropping.
         You don't want to drop any alignment so force=FALSE is your friend.


This is sort of tangential to the discussion, but do you really want to
do this? I would preserve the universe as given by the BAM.

       - An even more common operation is renaming: 90% of the times that's
         what I use seqlevels<- for, and that's what I tell people to use
         when they need to rename their chromosomes. Also straightforward:

           > seqlevels(gr)
           [1] "chr1" "chr2" "chr3" "chrM"

           > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
           > seqlevels(gr)
           [1] "1" "2" "3" "M"

           > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
           > seqlevels(gr)
           [1] "1"  "2"  "3"  "MT"

         Note that this is simpler to use than renameSeqlevels() which
         always required you to build the entire right value as a named
         vector.


Sure, renaming is a common use case. Not sure how I forgot that.

As I wrote earlier in the thread, renameSeqlevels should be changed so
as not to require naming of the vector.

    So the added-value of keepSeqlevels() seems to boil down to:

       (a) it always uses force=TRUE

       (b) it preserves the original object and returns the modified object

    If you want to restrict the use of keepSeqlevels() to permuting and
    dropping, you'll also have to disallow renaming (in addition to adding).
    Then its name will more or less reflect what it does (if "keep" means
    "permute" or "drop"). The final result will be something that does less
    than setSeqlevels() and that is not necessarily easier to read: both
    will set on the object whatever vector of seqlevels they are supplied,
    except that one will fail when doing so would actually result in adding
    or renaming seqlevels.


So it looks like seqlevels<- is pretty powerful. Now I see that if I
scroll down to the examples in the man page, I find the actual
documentation. It's cool to learn that we can replace seqlevels on TxDb
objects.


Has been working on TranscriptDb objects for more than

My argument has always been that since these low-level
operations are so powerful, it's nice to have high-level operations to
clarify the code. We have to think hard to know what the RHS will do in
that replacement.


When you do something like seqlevels(gr) <- new_seqlevels, you're saying
"can you please replace the current seqlevels by those new ones".
That's all. The request/contract is simple and all the complexity is
taken care of internally by seqlevels<-. It might be that, depending on
what's currently on 'gr', this will actually result in dropping, adding,
permuting. Or *a combination of the 3*. Like here:

  > seqlevels(gr)
  [1] "chr1" "chr2" "chr3"

  > seqlevels(gr) <- c("chr4", "chr3", "chr1")
  > seqlevels(gr)
  [1] "chr4" "chr3" "chr1"

  # one seqlevel was drop, one was added, and the 2 seqlevels that were
  # preserved were reordered.

When you want to subset a vector to keep all elements with an even
index you do x[c(FALSE, TRUE)]. Is that keeping or dropping elements?
Do we need to have a subsetting operator for keeping elements, another
one for dropping elements, and another one for permuting them (like
in x[sample(length(x))])?

H.

It's probably one of the most complex replacement
operations; far more complex than the typical one, including levels<- on
factor. There's nothing wrong with that; it's just complexity that we
should be able to hide.

Michael

    H.


        Michael


             H.



                 Well, I wasn't even aware of that feature, so you've
        made your
                 point about
                 the documentation ;) Sounds like a good compromise to me.

                 Thanks for understanding,
                 Michael



                     Martin


                         Michael






                         On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
                         <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
        <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>
                         <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org> <mailto:mtmor...@fhcrc.org
        <mailto:mtmor...@fhcrc.org>>>>

                         wrote:

                               On 05/18/2013 05:42 PM, Martin Morgan wrote:

                                   Some of the most common operations are
                         straight-forward, e.g.,

                                         > gr = GRanges(paste0("chr",
        c(1:22,
                         "X", "Y")), IRanges(1,
                         100))
                                         > seqlevels(gr) = sub("chr", "ch",
                         seqlevels(gr))


                               To get a more comprehensive example I
        should have
                         followed my own
                         advice and
                               grabbed from the help page!

                                     ## Rename:
                                     seqlevels(txdb) <- sub("chr", "",
                         seqlevels(txdb))
                                     seqlevels(txdb)

                                     seqlevels(txdb) <- paste0("CH",
        seqlevels(txdb))
                                     seqlevels(txdb)


        seqlevels(txdb)[seqlevels(__**____txdb) ==

                         "CHM"] <- "M"


                                     seqlevels(txdb)


                               --
                               Computational Biology / Fred Hutchinson
        Cancer
                         Research Center
                               1100 Fairview Ave. N.
                               PO Box 19024 Seattle, WA 98109

                               Location: Arnold Building M1 B861
                               Phone: (206) 667-2793
        <tel:%28206%29%20667-2793> <tel:%28206%29%20667-2793>
                         <tel:%28206%29%20667-2793>




                     --
                     Dr. Martin Morgan, PhD

                     Fred Hutchinson Cancer Research Center
                     1100 Fairview Ave. N.
                     PO Box 19024 Seattle, WA 98109


                          [[alternative HTML version deleted]]

                 ___________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        <mailto:Bioc-devel@r-project.__org
        <mailto:Bioc-devel@r-project.org>>
                 mailing list
        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

                 <https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>


             --
             Hervé Pagès

             Program in Computational Biology
             Division of Public Health Sciences

             Fred Hutchinson Cancer Research Center
             1100 Fairview Ave. N, M1-B514
             P.O. Box 19024
             Seattle, WA 98109-1024

             E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
        <mailto:hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>>
             Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
        <tel:%28206%29%20667-5791>
             Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
        <tel:%28206%29%20667-1319>



    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Reply via email to