Re: [Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Michael Lawrence Tue, 17 Sep 2013 11:31:57 -0700

Sounds great. Could we have a more intuitive name? Like naturalOrder(). And
then the method to fix the GRanges could be restoreNaturalOrder(). Just
some suggestions...





On Tue, Sep 17, 2013 at 10:58 AM, Hervé Pagès <hpa...@fhcrc.org> wrote:

> For sorting the chromosome names in "natural" order, you can use
> the makeSeqnameIds() helper function in GenomicRanges:
>
>   seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(**seqlevels(gr))]
>
> It recognizes several naming conventions (chr-prefixed or not,
> roman, etc...) e.g.:
>
>   > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
>   [1] 4 1 3 5 2
>
>   > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
>   [1] 4 1 3 5 2
>
> I still need to improve the documentation of this function, put
> more examples, and add unit tests. It's used internally by the
> makeTranscriptDb* functions in GenomicFeatures to assign internal
> ids to the chromosomes so that all the TranscriptDb extractors
> can then return GRanges and GRangesList objects with the seqlevels
> in the "natural" order.
>
> Cheers,
> H.
>
>
>
> On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
>
>> Actually, what I (and I think several others) just want is
>>    fixSeqnames()
>> which sorts and fixes them to some convention.  I don't really care which
>> one, but I want to do some easy harmonization, preferably in line with
>> other bioc tools.  I know this is hard to do for all organisms, but having
>> something that deals with the major ones (human, mouse, fruit fly, yeast,
>> some monkeys) would be extremely valuable.  If this function existed, I
>> could just throw all my GRanges at it.
>>
>> Kasper
>>
>>
>> On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence <
>> lawrence.mich...@gene.com
>>
>>> wrote:
>>>
>>
>>  When I hit this I usually just do:
>>>
>>> seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))
>>>
>>> It would be nice if there were a function for sorting chromosome names
>>> according to a specified naming convention. Like sortSeqlevels(gr,
>>> UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
>>> sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.
>>>
>>> Michael
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
>>> <st...@channing.harvard.edu>**wrote:
>>>
>>>  I am replying to this as Michael mentions that this is a powerful
>>>> operation.  Here's my
>>>> problem that I believe is related, but I do not see a straightforward
>>>> solution
>>>>
>>>>  bhmm19uni
>>>>>
>>>> GRanges with 559456 ranges and 4 metadata columns:
>>>>           seqnames                 ranges strand   |              name
>>>> score
>>>>              <Rle>              <IRanges>  <Rle>   |       <character>
>>>> <numeric>
>>>>         1     chr1 [ 67229025,  67341224]      *   | 13_Heterochrom/lo
>>>>      0
>>>>         2     chr1 [203057955, 203060154]      *   |      12_Repressed
>>>>      0
>>>>         3     chr1 [  8458427,   8486226]      *   | 13_Heterochrom/lo
>>>>      0
>>>>         4     chr1 [ 16904427,  16904826]      *   | 5_Strong_Enhancer
>>>>      0
>>>>         5     chr1 [ 25289427,  25293426]      *   |       11_Weak_Txn
>>>>      0
>>>>       ...      ...                    ...    ... ...               ...
>>>>    ...
>>>>    570766    chr22   [51100931, 51101330]      *   |   2_Weak_Promoter
>>>>      0
>>>>    570767    chr22   [51101331, 51101530]      *   |   6_Weak_Enhancer
>>>>      0
>>>>    570768    chr22   [51101531, 51101730]      *   |   2_Weak_Promoter
>>>>      0
>>>>    570769    chr22   [51101731, 51178130]      *   | 13_Heterochrom/lo
>>>>      0
>>>>    570770    chr22   [51178131, 51178330]      *   | 15_Repetitive/CNV
>>>>      0
>>>>                            thick     numcode
>>>>                        <IRanges> <character>
>>>>         1 [ 67001613,  67113812]          13
>>>>         2 [201324578, 201326777]          12
>>>>         3 [  8381014,   8408813]          13
>>>>         4 [ 16777014,  16777413]           5
>>>>         5 [ 25162014,  25166013]          11
>>>>       ...                    ...         ...
>>>>    570766   [49447797, 49448196]           2
>>>>    570767   [49448197, 49448396]           6
>>>>    570768   [49448397, 49448596]           2
>>>>    570769   [49448597, 49524996]          13
>>>>    570770   [49524997, 49525196]          15
>>>>    ---
>>>>    seqlengths:
>>>>          chr1     chr10     chr11      chr5 ...      chr2     chr21
>>>>   chr4
>>>>     249250621 135534747 135006516 180915260 ... 243199373  48129895
>>>> 191154276
>>>>
>>>> If I try to make a karyogram with ggbio, the chromosomes are in an
>>>> unnatural order.  What is the right approach to getting the seqinfo in a
>>>> natural order with respect to chromosome names and lengths?  Reassigning
>>>> seqlevels seems dangerous but I haven't experimented fully.  It must be
>>>> a
>>>> common use case but I do not see it in doc.
>>>>
>>>>
>>>>
>>>> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
>>>> lawrence.mich...@gene.com> wrote:
>>>>
>>>>  On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpa...@fhcrc.org>
>>>>> wrote:
>>>>>
>>>>>  Michael,
>>>>>>
>>>>>>
>>>>>> On 05/20/2013 08:44 PM, Michael Lawrence wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpa...@fhcrc.org
>>>>>>> <mailto:hpa...@fhcrc.org>> wrote:
>>>>>>>
>>>>>>>      Hi,
>>>>>>>
>>>>>>>
>>>>>>>      On 05/20/2013 03:15 PM, Michael Lawrence wrote:
>>>>>>>
>>>>>>>          On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
>>>>>>>          <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> wrote:
>>>>>>>
>>>>>>>              Hi Michael --
>>>>>>>
>>>>>>>
>>>>>>>              On 5/20/2013 5:56 AM, Michael Lawrence wrote:
>>>>>>>
>>>>>>>                  While it's cool that seqlevels<- has become so
>>>>>>>
>>>>>> flexible,
>>>>>
>>>>>>                  I still claim
>>>>>>>                  that
>>>>>>>                  rename/keep/drop would be a lot easier to read,
>>>>>>>
>>>>>> because
>>>
>>>>                  they describe the
>>>>>>>                  high-level operation, and there's no reason to
>>>>>>>
>>>>>> deprecate
>>>>>
>>>>>>                  them. They're
>>>>>>>                  also
>>>>>>>                  easier to remember. For example, for dropping with
>>>>>>>                  seqlevels<-, the user
>>>>>>>                  needs
>>>>>>>                  to remember that "force" is necessary to drop. Much
>>>>>>>                  easier to just say
>>>>>>>                  "dropSeqlevels(), please". And reimplementing
>>>>>>>                  keepSeqlevels is still too
>>>>>>>                  complicated. Is it such a maintenance burden to have
>>>>>>>                  these simple
>>>>>>>                  wrappers sit
>>>>>>>                  on top of the low-level manipulators?
>>>>>>>
>>>>>>>                  Another suggestion: renameSeqlevels should not
>>>>>>>
>>>>>> require
>>>
>>>> a
>>>>>
>>>>>>                  named vector (in
>>>>>>>                  cases
>>>>>>>                  where it is unnamed, require that it is parallel to
>>>>>>>
>>>>>> the
>>>
>>>>                  existing
>>>>>>>                  seqlevels, as
>>>>>>>                  with seqlevels<-).
>>>>>>>
>>>>>>>
>>>>>>>              I didn't really indicate what drove my desire to see
>>>>>>>              keepSeqlevels
>>>>>>>              deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
>>>>>>>
>>>>>> were
>>>>>
>>>>>>              developed more
>>>>>>>              or less independently, and have different contracts. The
>>>>>>>              different
>>>>>>>              contracts are confusing to the user, as is creating a
>>>>>>>
>>>>>> usable
>>>>>
>>>>>>              help system
>>>>>>>              when there are multiple end points for what is a common
>>>>>>>              operation. The help
>>>>>>>              pages of each were inadequate in different ways. And
>>>>>>>
>>>>>> because
>>>>>
>>>>>>              the code was
>>>>>>>              developed independently, support for different objects
>>>>>>>
>>>>>> was
>>>
>>>>              inconsistent. So
>>>>>>>              actually, yes, the maintenance (and use) burden for the
>>>>>>>              previous state of
>>>>>>>              affairs was substantial.
>>>>>>>
>>>>>>>              On the other hand, I agree that keepSeqlevels is
>>>>>>>
>>>>>> convenient
>>>
>>>>              as a simple
>>>>>>>              wrapper around seqlevels<-, in the same way that
>>>>>>> setNames
>>>>>>>              and names<- are
>>>>>>>              both useful.
>>>>>>>
>>>>>>>              So we could iterate to
>>>>>>>
>>>>>>>                  keepSeqlevels <-
>>>>>>>                      function(x, value, ...)
>>>>>>>                  {
>>>>>>>                      seqlevels(x, force=TRUE) <- value
>>>>>>>                      x
>>>>>>>                  }
>>>>>>>
>>>>>>>              but I'd be less enthusiastic about maintaining the
>>>>>>>
>>>>>> original
>>>
>>>>              contract of
>>>>>>>              keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
>>>>>>>              worked for two
>>>>>>>              GRanges objects.
>>>>>>>
>>>>>>>
>>>>>>>      Why would this be called keepSeqlevels() if, depending on how
>>>>>>>
>>>>>> it's
>>>
>>>> used,
>>>>>>>      it will either add, drop, rename, or permute the seqlevels?
>>>>>>>      Couldn't this be called setSeqlevels?
>>>>>>>
>>>>>>>
>>>>>>> I thought that new2old was necessary for permuting.
>>>>>>>
>>>>>>>
>>>>>> The seqlevels setter has no 'new2old' arg.
>>>>>>
>>>>>>
>>>>>>   You're right that
>>>>>>
>>>>>>> adding should be disallowed for keepSeqlevels(). Adding seqlevels is
>>>>>>>
>>>>>> not
>>>>>
>>>>>> a common operation. The two common operations are:
>>>>>>> * Permuting, usually because the data were imported in non-canonical
>>>>>>> order (seqnameStyle was designed to address this, no?).
>>>>>>>
>>>>>>>
>>>>>> Permuting is straightforward with seqlevels<-:
>>>>>>
>>>>>>    > seqlevels(gr)
>>>>>>    [1] "chr1" "chr2" "chr3"
>>>>>>
>>>>>>    > seqlevels(gr) <- rev(seqlevels(gr))
>>>>>>
>>>>>>    > seqlevels(gr)
>>>>>>    [1] "chr3" "chr2" "chr1"
>>>>>>
>>>>>>
>>>>>>   * Subsetting, either by keeping or dropping.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Also straightforward:
>>>>>>
>>>>>>    > seqlevels(gr, force=TRUE) <- "chr2"
>>>>>>
>>>>>>    > seqlevels(gr)
>>>>>>    [1] "chr2"
>>>>>>
>>>>>>
>>>>>>   Two main reasons: taking a
>>>>>>
>>>>>>> small slice of the data for prototyping, or removing problematic
>>>>>>> chromosomes (sex, circular). This goes back to bsapply and the
>>>>>>>
>>>>>> exclude
>>>
>>>> argument.
>>>>>>>
>>>>>>>
>>>>>> There are other important use cases:
>>>>>>
>>>>>>    - Dropping seqlevels that are *not* in use. This happens for
>>>>>> example
>>>>>>      when subsetting a BAM file with Rsamtools::filterBam to keep only
>>>>>>      the alignments located on a given chromosome. The entire sequence
>>>>>>      dictionary (in the header) is propagated to the resulting BAM
>>>>>> file
>>>>>>      so when you read the file with readGAlignments() you end up with
>>>>>> a
>>>>>>      bunch of useless seqlevels that you generally start by dropping.
>>>>>>      You don't want to drop any alignment so force=FALSE is your
>>>>>>
>>>>> friend.
>>>
>>>>
>>>>>>
>>>>>>  This is sort of tangential to the discussion, but do you really want
>>>>> to
>>>>>
>>>> do
>>>
>>>> this? I would preserve the universe as given by the BAM.
>>>>>
>>>>>
>>>>>     - An even more common operation is renaming: 90% of the times
>>>>>> that's
>>>>>>      what I use seqlevels<- for, and that's what I tell people to use
>>>>>>      when they need to rename their chromosomes. Also straightforward:
>>>>>>
>>>>>>        > seqlevels(gr)
>>>>>>        [1] "chr1" "chr2" "chr3" "chrM"
>>>>>>
>>>>>>        > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
>>>>>>        > seqlevels(gr)
>>>>>>        [1] "1" "2" "3" "M"
>>>>>>
>>>>>>        > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
>>>>>>        > seqlevels(gr)
>>>>>>        [1] "1"  "2"  "3"  "MT"
>>>>>>
>>>>>>      Note that this is simpler to use than renameSeqlevels() which
>>>>>>      always required you to build the entire right value as a named
>>>>>>      vector.
>>>>>>
>>>>>>
>>>>>>  Sure, renaming is a common use case. Not sure how I forgot that.
>>>>>
>>>>> As I wrote earlier in the thread, renameSeqlevels should be changed so
>>>>>
>>>> as
>>>
>>>> not to require naming of the vector.
>>>>>
>>>>>
>>>>>  So the added-value of keepSeqlevels() seems to boil down to:
>>>>>>
>>>>>>    (a) it always uses force=TRUE
>>>>>>
>>>>>>    (b) it preserves the original object and returns the modified
>>>>>> object
>>>>>>
>>>>>> If you want to restrict the use of keepSeqlevels() to permuting and
>>>>>> dropping, you'll also have to disallow renaming (in addition to
>>>>>>
>>>>> adding).
>>>
>>>> Then its name will more or less reflect what it does (if "keep" means
>>>>>> "permute" or "drop"). The final result will be something that does
>>>>>>
>>>>> less
>>>
>>>> than setSeqlevels() and that is not necessarily easier to read: both
>>>>>> will set on the object whatever vector of seqlevels they are supplied,
>>>>>> except that one will fail when doing so would actually result in
>>>>>>
>>>>> adding
>>>
>>>> or renaming seqlevels.
>>>>>>
>>>>>>
>>>>>>  So it looks like seqlevels<- is pretty powerful. Now I see that if I
>>>>> scroll
>>>>> down to the examples in the man page, I find the actual documentation.
>>>>> It's
>>>>> cool to learn that we can replace seqlevels on TxDb objects. My
>>>>> argument
>>>>> has always been that since these low-level operations are so powerful,
>>>>> it's
>>>>> nice to have high-level operations to clarify the code. We have to
>>>>> think
>>>>> hard to know what the RHS will do in that replacement. It's probably
>>>>> one
>>>>> of
>>>>> the most complex replacement operations; far more complex than the
>>>>>
>>>> typical
>>>
>>>> one, including levels<- on factor. There's nothing wrong with that; it's
>>>>> just complexity that we should be able to hide.
>>>>>
>>>>> Michael
>>>>>
>>>>> H.
>>>>>
>>>>>>
>>>>>>
>>>>>>  Michael
>>>>>>>
>>>>>>>
>>>>>>>      H.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>          Well, I wasn't even aware of that feature, so you've made
>>>>>>>
>>>>>> your
>>>
>>>>          point about
>>>>>>>          the documentation ;) Sounds like a good compromise to me.
>>>>>>>
>>>>>>>          Thanks for understanding,
>>>>>>>          Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              Martin
>>>>>>>
>>>>>>>
>>>>>>>                  Michael
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                  On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
>>>>>>>                  <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
>>>>>>>                  <mailto:mtmor...@fhcrc.org <mailto:
>>>>>>>
>>>>>> mtmor...@fhcrc.org
>>>
>>>>
>>>>>>>>
>>>>>>>                  wrote:
>>>>>>>
>>>>>>>                        On 05/18/2013 05:42 PM, Martin Morgan wrote:
>>>>>>>
>>>>>>>                            Some of the most common operations are
>>>>>>>                  straight-forward, e.g.,
>>>>>>>
>>>>>>>                                  > gr = GRanges(paste0("chr", c(1:22,
>>>>>>>                  "X", "Y")), IRanges(1,
>>>>>>>                  100))
>>>>>>>                                  > seqlevels(gr) = sub("chr", "ch",
>>>>>>>                  seqlevels(gr))
>>>>>>>
>>>>>>>
>>>>>>>                        To get a more comprehensive example I should
>>>>>>>
>>>>>> have
>>>
>>>>                  followed my own
>>>>>>>                  advice and
>>>>>>>                        grabbed from the help page!
>>>>>>>
>>>>>>>                              ## Rename:
>>>>>>>                              seqlevels(txdb) <- sub("chr", "",
>>>>>>>                  seqlevels(txdb))
>>>>>>>                              seqlevels(txdb)
>>>>>>>
>>>>>>>                              seqlevels(txdb) <- paste0("CH",
>>>>>>> seqlevels(txdb))
>>>>>>>                              seqlevels(txdb)
>>>>>>>
>>>>>>>                              seqlevels(txdb)[seqlevels(__****
>>>>>>> **__txdb)
>>>>>>>
>>>>>> ==
>>>
>>>>
>>>>>
>>>>>>>                  "CHM"] <- "M"
>>>>>>>
>>>>>>>
>>>>>>>                              seqlevels(txdb)
>>>>>>>
>>>>>>>
>>>>>>>                        --
>>>>>>>                        Computational Biology / Fred Hutchinson Cancer
>>>>>>>                  Research Center
>>>>>>>                        1100 Fairview Ave. N.
>>>>>>>                        PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>>                        Location: Arnold Building M1 B861
>>>>>>>                        Phone: (206) 667-2793
>>>>>>>
>>>>>> <tel:%28206%29%20667-2793>
>>>
>>>>                  <tel:%28206%29%20667-2793>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>              --
>>>>>>>              Dr. Martin Morgan, PhD
>>>>>>>
>>>>>>>              Fred Hutchinson Cancer Research Center
>>>>>>>              1100 Fairview Ave. N.
>>>>>>>              PO Box 19024 Seattle, WA 98109
>>>>>>>
>>>>>>>
>>>>>>>                   [[alternative HTML version deleted]]
>>>>>>>
>>>>>>>          ______________________________****___________________
>>>>>>>          Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.***
>>>>>>> *org<
>>>>>>>
>>>>>> Bioc-devel@r-project.org>
>>>>>
>>>>>>
>>>>>>>>           mailing list
>>>>>>>          
>>>>>>> https://stat.ethz.ch/mailman/_****_listinfo/bioc-devel<https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel>
>>>>>>> <
>>>>>>>
>>>>>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel>
>>>>> >
>>>>>
>>>>>>
>>>>>>>          
>>>>>>> <https://stat.ethz.ch/mailman/****listinfo/bioc-devel<https://stat.ethz.ch/mailman/**listinfo/bioc-devel>
>>>>>>> <
>>>>>>>
>>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>> >
>>>>>
>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>      --
>>>>>>>      Hervé Pagès
>>>>>>>
>>>>>>>      Program in Computational Biology
>>>>>>>      Division of Public Health Sciences
>>>>>>>
>>>>>>>      Fred Hutchinson Cancer Research Center
>>>>>>>      1100 Fairview Ave. N, M1-B514
>>>>>>>      P.O. Box 19024
>>>>>>>      Seattle, WA 98109-1024
>>>>>>>
>>>>>>>      E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
>>>>>>>      Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>>>>>>>      Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>> Hervé Pagès
>>>>>>
>>>>>> Program in Computational Biology
>>>>>> Division of Public Health Sciences
>>>>>> Fred Hutchinson Cancer Research Center
>>>>>> 1100 Fairview Ave. N, M1-B514
>>>>>> P.O. Box 19024
>>>>>> Seattle, WA 98109-1024
>>>>>>
>>>>>> E-mail: hpa...@fhcrc.org
>>>>>> Phone:  (206) 667-5791
>>>>>> Fax:    (206) 667-1319
>>>>>>
>>>>>>
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>>
>>>>> ______________________________**_________________
>>>>> Bioc-devel@r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>>>
>>>>>
>>>>>
>>>>
>>>          [[alternative HTML version deleted]]
>>>
>>>
>>> ______________________________**_________________
>>> Bioc-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>>
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>>
>>
>> ______________________________**_________________
>> Bioc-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fhcrc.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] deprecation of keepSeqlevels and renameSeqlevels

Reply via email to