Re: [Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Michael Lawrence Tue, 21 May 2013 08:01:33 -0700

On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpa...@fhcrc.org> wrote:


> Michael,
>
>
> On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>
>>
>>
>>
>> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpa...@fhcrc.org
>> <mailto:hpa...@fhcrc.org>> wrote:
>>
>>     Hi,
>>
>>
>>     On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>>
>>         On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
>>         <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> wrote:
>>
>>             Hi Michael --
>>
>>
>>             On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>>
>>                 While it's cool that seqlevels<- has become so flexible,
>>                 I still claim
>>                 that
>>                 rename/keep/drop would be a lot easier to read, because
>>                 they describe the
>>                 high-level operation, and there's no reason to deprecate
>>                 them. They're
>>                 also
>>                 easier to remember. For example, for dropping with
>>                 seqlevels<-, the user
>>                 needs
>>                 to remember that "force" is necessary to drop. Much
>>                 easier to just say
>>                 "dropSeqlevels(), please". And reimplementing
>>                 keepSeqlevels is still too
>>                 complicated. Is it such a maintenance burden to have
>>                 these simple
>>                 wrappers sit
>>                 on top of the low-level manipulators?
>>
>>                 Another suggestion: renameSeqlevels should not require a
>>                 named vector (in
>>                 cases
>>                 where it is unnamed, require that it is parallel to the
>>                 existing
>>                 seqlevels, as
>>                 with seqlevels<-).
>>
>>
>>             I didn't really indicate what drove my desire to see
>>             keepSeqlevels
>>             deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq were
>>             developed more
>>             or less independently, and have different contracts. The
>>             different
>>             contracts are confusing to the user, as is creating a usable
>>             help system
>>             when there are multiple end points for what is a common
>>             operation. The help
>>             pages of each were inadequate in different ways. And because
>>             the code was
>>             developed independently, support for different objects was
>>             inconsistent. So
>>             actually, yes, the maintenance (and use) burden for the
>>             previous state of
>>             affairs was substantial.
>>
>>             On the other hand, I agree that keepSeqlevels is convenient
>>             as a simple
>>             wrapper around seqlevels<-, in the same way that setNames
>>             and names<- are
>>             both useful.
>>
>>             So we could iterate to
>>
>>                 keepSeqlevels <-
>>                     function(x, value, ...)
>>                 {
>>                     seqlevels(x, force=TRUE) <- value
>>                     x
>>                 }
>>
>>             but I'd be less enthusiastic about maintaining the original
>>             contract of
>>             keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
>>             worked for two
>>             GRanges objects.
>>
>>
>>     Why would this be called keepSeqlevels() if, depending on how it's
>> used,
>>     it will either add, drop, rename, or permute the seqlevels?
>>     Couldn't this be called setSeqlevels?
>>
>>
>> I thought that new2old was necessary for permuting.
>>
>
> The seqlevels setter has no 'new2old' arg.
>
>
>  You're right that
>> adding should be disallowed for keepSeqlevels(). Adding seqlevels is not
>> a common operation. The two common operations are:
>> * Permuting, usually because the data were imported in non-canonical
>> order (seqnameStyle was designed to address this, no?).
>>
>
> Permuting is straightforward with seqlevels<-:
>
>   > seqlevels(gr)
>   [1] "chr1" "chr2" "chr3"
>
>   > seqlevels(gr) <- rev(seqlevels(gr))
>
>   > seqlevels(gr)
>   [1] "chr3" "chr2" "chr1"
>
>
>  * Subsetting, either by keeping or dropping.
>>
>
> Also straightforward:
>
>   > seqlevels(gr, force=TRUE) <- "chr2"
>
>   > seqlevels(gr)
>   [1] "chr2"
>
>
>  Two main reasons: taking a
>> small slice of the data for prototyping, or removing problematic
>> chromosomes (sex, circular). This goes back to bsapply and the exclude
>> argument.
>>
>
> There are other important use cases:
>
>   - Dropping seqlevels that are *not* in use. This happens for example
>     when subsetting a BAM file with Rsamtools::filterBam to keep only
>     the alignments located on a given chromosome. The entire sequence
>     dictionary (in the header) is propagated to the resulting BAM file
>     so when you read the file with readGAlignments() you end up with a
>     bunch of useless seqlevels that you generally start by dropping.
>     You don't want to drop any alignment so force=FALSE is your friend.
>
>
This is sort of tangential to the discussion, but do you really want to do
this? I would preserve the universe as given by the BAM.


>   - An even more common operation is renaming: 90% of the times that's
>     what I use seqlevels<- for, and that's what I tell people to use
>     when they need to rename their chromosomes. Also straightforward:
>
>       > seqlevels(gr)
>       [1] "chr1" "chr2" "chr3" "chrM"
>
>       > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
>       > seqlevels(gr)
>       [1] "1" "2" "3" "M"
>
>       > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
>       > seqlevels(gr)
>       [1] "1"  "2"  "3"  "MT"
>
>     Note that this is simpler to use than renameSeqlevels() which
>     always required you to build the entire right value as a named
>     vector.
>
>
Sure, renaming is a common use case. Not sure how I forgot that.

As I wrote earlier in the thread, renameSeqlevels should be changed so as
not to require naming of the vector.


> So the added-value of keepSeqlevels() seems to boil down to:
>
>   (a) it always uses force=TRUE
>
>   (b) it preserves the original object and returns the modified object
>
> If you want to restrict the use of keepSeqlevels() to permuting and
> dropping, you'll also have to disallow renaming (in addition to adding).
> Then its name will more or less reflect what it does (if "keep" means
> "permute" or "drop"). The final result will be something that does less
> than setSeqlevels() and that is not necessarily easier to read: both
> will set on the object whatever vector of seqlevels they are supplied,
> except that one will fail when doing so would actually result in adding
> or renaming seqlevels.
>
>
So it looks like seqlevels<- is pretty powerful. Now I see that if I scroll
down to the examples in the man page, I find the actual documentation. It's
cool to learn that we can replace seqlevels on TxDb objects. My argument
has always been that since these low-level operations are so powerful, it's
nice to have high-level operations to clarify the code. We have to think
hard to know what the RHS will do in that replacement. It's probably one of
the most complex replacement operations; far more complex than the typical
one, including levels<- on factor. There's nothing wrong with that; it's
just complexity that we should be able to hide.

Michael

H.
>
>
>> Michael
>>
>>
>>     H.
>>
>>
>>
>>         Well, I wasn't even aware of that feature, so you've made your
>>         point about
>>         the documentation ;) Sounds like a good compromise to me.
>>
>>         Thanks for understanding,
>>         Michael
>>
>>
>>
>>             Martin
>>
>>
>>                 Michael
>>
>>
>>
>>
>>
>>
>>                 On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
>>                 <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
>>                 <mailto:mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>>>
>>
>>                 wrote:
>>
>>                       On 05/18/2013 05:42 PM, Martin Morgan wrote:
>>
>>                           Some of the most common operations are
>>                 straight-forward, e.g.,
>>
>>                                 > gr = GRanges(paste0("chr", c(1:22,
>>                 "X", "Y")), IRanges(1,
>>                 100))
>>                                 > seqlevels(gr) = sub("chr", "ch",
>>                 seqlevels(gr))
>>
>>
>>                       To get a more comprehensive example I should have
>>                 followed my own
>>                 advice and
>>                       grabbed from the help page!
>>
>>                             ## Rename:
>>                             seqlevels(txdb) <- sub("chr", "",
>>                 seqlevels(txdb))
>>                             seqlevels(txdb)
>>
>>                             seqlevels(txdb) <- paste0("CH",
>> seqlevels(txdb))
>>                             seqlevels(txdb)
>>
>>                             seqlevels(txdb)[seqlevels(__****__txdb) ==
>>
>>                 "CHM"] <- "M"
>>
>>
>>                             seqlevels(txdb)
>>
>>
>>                       --
>>                       Computational Biology / Fred Hutchinson Cancer
>>                 Research Center
>>                       1100 Fairview Ave. N.
>>                       PO Box 19024 Seattle, WA 98109
>>
>>                       Location: Arnold Building M1 B861
>>                       Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>>                 <tel:%28206%29%20667-2793>
>>
>>
>>
>>
>>             --
>>             Dr. Martin Morgan, PhD
>>
>>             Fred Hutchinson Cancer Research Center
>>             1100 Fairview Ave. N.
>>             PO Box 19024 Seattle, WA 98109
>>
>>
>>                  [[alternative HTML version deleted]]
>>
>>         ______________________________**___________________
>>         Bioc-devel@r-project.org 
>> <mailto:Bioc-devel@r-project.**org<Bioc-devel@r-project.org>
>> >
>>         mailing list
>>         
>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>
>>         
>> <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> >
>>
>>
>>     --
>>     Hervé Pagès
>>
>>     Program in Computational Biology
>>     Division of Public Health Sciences
>>
>>     Fred Hutchinson Cancer Research Center
>>     1100 Fairview Ave. N, M1-B514
>>     P.O. Box 19024
>>     Seattle, WA 98109-1024
>>
>>     E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
>>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Reply via email to