For sorting the chromosome names in "natural" order, you can use
the makeSeqnameIds() helper function in GenomicRanges:

  seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(seqlevels(gr))]

It recognizes several naming conventions (chr-prefixed or not,
roman, etc...) e.g.:

  > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
  [1] 4 1 3 5 2

  > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
  [1] 4 1 3 5 2

I still need to improve the documentation of this function, put
more examples, and add unit tests. It's used internally by the
makeTranscriptDb* functions in GenomicFeatures to assign internal
ids to the chromosomes so that all the TranscriptDb extractors
can then return GRanges and GRangesList objects with the seqlevels
in the "natural" order.

Cheers,
H.


On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:
Actually, what I (and I think several others) just want is
   fixSeqnames()
which sorts and fixes them to some convention.  I don't really care which
one, but I want to do some easy harmonization, preferably in line with
other bioc tools.  I know this is hard to do for all organisms, but having
something that deals with the major ones (human, mouse, fruit fly, yeast,
some monkeys) would be extremely valuable.  If this function existed, I
could just throw all my GRanges at it.

Kasper


On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence <lawrence.mich...@gene.com
wrote:

When I hit this I usually just do:

seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

It would be nice if there were a function for sorting chromosome names
according to a specified naming convention. Like sortSeqlevels(gr,
UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <-
sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch.

Michael






On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
<st...@channing.harvard.edu>wrote:

I am replying to this as Michael mentions that this is a powerful
operation.  Here's my
problem that I believe is related, but I do not see a straightforward
solution

bhmm19uni
GRanges with 559456 ranges and 4 metadata columns:
          seqnames                 ranges strand   |              name
score
             <Rle>              <IRanges>  <Rle>   |       <character>
<numeric>
        1     chr1 [ 67229025,  67341224]      *   | 13_Heterochrom/lo
     0
        2     chr1 [203057955, 203060154]      *   |      12_Repressed
     0
        3     chr1 [  8458427,   8486226]      *   | 13_Heterochrom/lo
     0
        4     chr1 [ 16904427,  16904826]      *   | 5_Strong_Enhancer
     0
        5     chr1 [ 25289427,  25293426]      *   |       11_Weak_Txn
     0
      ...      ...                    ...    ... ...               ...
   ...
   570766    chr22   [51100931, 51101330]      *   |   2_Weak_Promoter
     0
   570767    chr22   [51101331, 51101530]      *   |   6_Weak_Enhancer
     0
   570768    chr22   [51101531, 51101730]      *   |   2_Weak_Promoter
     0
   570769    chr22   [51101731, 51178130]      *   | 13_Heterochrom/lo
     0
   570770    chr22   [51178131, 51178330]      *   | 15_Repetitive/CNV
     0
                           thick     numcode
                       <IRanges> <character>
        1 [ 67001613,  67113812]          13
        2 [201324578, 201326777]          12
        3 [  8381014,   8408813]          13
        4 [ 16777014,  16777413]           5
        5 [ 25162014,  25166013]          11
      ...                    ...         ...
   570766   [49447797, 49448196]           2
   570767   [49448197, 49448396]           6
   570768   [49448397, 49448596]           2
   570769   [49448597, 49524996]          13
   570770   [49524997, 49525196]          15
   ---
   seqlengths:
         chr1     chr10     chr11      chr5 ...      chr2     chr21
  chr4
    249250621 135534747 135006516 180915260 ... 243199373  48129895
191154276

If I try to make a karyogram with ggbio, the chromosomes are in an
unnatural order.  What is the right approach to getting the seqinfo in a
natural order with respect to chromosome names and lengths?  Reassigning
seqlevels seems dangerous but I haven't experimented fully.  It must be a
common use case but I do not see it in doc.



On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
lawrence.mich...@gene.com> wrote:

On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpa...@fhcrc.org> wrote:

Michael,


On 05/20/2013 08:44 PM, Michael Lawrence wrote:




On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpa...@fhcrc.org
<mailto:hpa...@fhcrc.org>> wrote:

     Hi,


     On 05/20/2013 03:15 PM, Michael Lawrence wrote:

         On Mon, May 20, 2013 at 3:11 PM, Martin Morgan
         <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> wrote:

             Hi Michael --


             On 5/20/2013 5:56 AM, Michael Lawrence wrote:

                 While it's cool that seqlevels<- has become so
flexible,
                 I still claim
                 that
                 rename/keep/drop would be a lot easier to read,
because
                 they describe the
                 high-level operation, and there's no reason to
deprecate
                 them. They're
                 also
                 easier to remember. For example, for dropping with
                 seqlevels<-, the user
                 needs
                 to remember that "force" is necessary to drop. Much
                 easier to just say
                 "dropSeqlevels(), please". And reimplementing
                 keepSeqlevels is still too
                 complicated. Is it such a maintenance burden to have
                 these simple
                 wrappers sit
                 on top of the low-level manipulators?

                 Another suggestion: renameSeqlevels should not
require
a
                 named vector (in
                 cases
                 where it is unnamed, require that it is parallel to
the
                 existing
                 seqlevels, as
                 with seqlevels<-).


             I didn't really indicate what drove my desire to see
             keepSeqlevels
             deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq
were
             developed more
             or less independently, and have different contracts. The
             different
             contracts are confusing to the user, as is creating a
usable
             help system
             when there are multiple end points for what is a common
             operation. The help
             pages of each were inadequate in different ways. And
because
             the code was
             developed independently, support for different objects
was
             inconsistent. So
             actually, yes, the maintenance (and use) burden for the
             previous state of
             affairs was substantial.

             On the other hand, I agree that keepSeqlevels is
convenient
             as a simple
             wrapper around seqlevels<-, in the same way that setNames
             and names<- are
             both useful.

             So we could iterate to

                 keepSeqlevels <-
                     function(x, value, ...)
                 {
                     seqlevels(x, force=TRUE) <- value
                     x
                 }

             but I'd be less enthusiastic about maintaining the
original
             contract of
             keepSeqlevels, where keepSeqlevels(gr1, gr2), would have
             worked for two
             GRanges objects.


     Why would this be called keepSeqlevels() if, depending on how
it's
used,
     it will either add, drop, rename, or permute the seqlevels?
     Couldn't this be called setSeqlevels?


I thought that new2old was necessary for permuting.


The seqlevels setter has no 'new2old' arg.


  You're right that
adding should be disallowed for keepSeqlevels(). Adding seqlevels is
not
a common operation. The two common operations are:
* Permuting, usually because the data were imported in non-canonical
order (seqnameStyle was designed to address this, no?).


Permuting is straightforward with seqlevels<-:

   > seqlevels(gr)
   [1] "chr1" "chr2" "chr3"

   > seqlevels(gr) <- rev(seqlevels(gr))

   > seqlevels(gr)
   [1] "chr3" "chr2" "chr1"


  * Subsetting, either by keeping or dropping.


Also straightforward:

   > seqlevels(gr, force=TRUE) <- "chr2"

   > seqlevels(gr)
   [1] "chr2"


  Two main reasons: taking a
small slice of the data for prototyping, or removing problematic
chromosomes (sex, circular). This goes back to bsapply and the
exclude
argument.


There are other important use cases:

   - Dropping seqlevels that are *not* in use. This happens for example
     when subsetting a BAM file with Rsamtools::filterBam to keep only
     the alignments located on a given chromosome. The entire sequence
     dictionary (in the header) is propagated to the resulting BAM file
     so when you read the file with readGAlignments() you end up with a
     bunch of useless seqlevels that you generally start by dropping.
     You don't want to drop any alignment so force=FALSE is your
friend.


This is sort of tangential to the discussion, but do you really want to
do
this? I would preserve the universe as given by the BAM.


   - An even more common operation is renaming: 90% of the times that's
     what I use seqlevels<- for, and that's what I tell people to use
     when they need to rename their chromosomes. Also straightforward:

       > seqlevels(gr)
       [1] "chr1" "chr2" "chr3" "chrM"

       > seqlevels(gr) <- sub("^chr", "", seqlevels(gr))
       > seqlevels(gr)
       [1] "1" "2" "3" "M"

       > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT"
       > seqlevels(gr)
       [1] "1"  "2"  "3"  "MT"

     Note that this is simpler to use than renameSeqlevels() which
     always required you to build the entire right value as a named
     vector.


Sure, renaming is a common use case. Not sure how I forgot that.

As I wrote earlier in the thread, renameSeqlevels should be changed so
as
not to require naming of the vector.


So the added-value of keepSeqlevels() seems to boil down to:

   (a) it always uses force=TRUE

   (b) it preserves the original object and returns the modified object

If you want to restrict the use of keepSeqlevels() to permuting and
dropping, you'll also have to disallow renaming (in addition to
adding).
Then its name will more or less reflect what it does (if "keep" means
"permute" or "drop"). The final result will be something that does
less
than setSeqlevels() and that is not necessarily easier to read: both
will set on the object whatever vector of seqlevels they are supplied,
except that one will fail when doing so would actually result in
adding
or renaming seqlevels.


So it looks like seqlevels<- is pretty powerful. Now I see that if I
scroll
down to the examples in the man page, I find the actual documentation.
It's
cool to learn that we can replace seqlevels on TxDb objects. My argument
has always been that since these low-level operations are so powerful,
it's
nice to have high-level operations to clarify the code. We have to think
hard to know what the RHS will do in that replacement. It's probably one
of
the most complex replacement operations; far more complex than the
typical
one, including levels<- on factor. There's nothing wrong with that; it's
just complexity that we should be able to hide.

Michael

H.


Michael


     H.



         Well, I wasn't even aware of that feature, so you've made
your
         point about
         the documentation ;) Sounds like a good compromise to me.

         Thanks for understanding,
         Michael



             Martin


                 Michael






                 On Sat, May 18, 2013 at 6:00 PM, Martin Morgan
                 <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>
                 <mailto:mtmor...@fhcrc.org <mailto:
mtmor...@fhcrc.org


                 wrote:

                       On 05/18/2013 05:42 PM, Martin Morgan wrote:

                           Some of the most common operations are
                 straight-forward, e.g.,

                                 > gr = GRanges(paste0("chr", c(1:22,
                 "X", "Y")), IRanges(1,
                 100))
                                 > seqlevels(gr) = sub("chr", "ch",
                 seqlevels(gr))


                       To get a more comprehensive example I should
have
                 followed my own
                 advice and
                       grabbed from the help page!

                             ## Rename:
                             seqlevels(txdb) <- sub("chr", "",
                 seqlevels(txdb))
                             seqlevels(txdb)

                             seqlevels(txdb) <- paste0("CH",
seqlevels(txdb))
                             seqlevels(txdb)

                             seqlevels(txdb)[seqlevels(__****__txdb)
==


                 "CHM"] <- "M"


                             seqlevels(txdb)


                       --
                       Computational Biology / Fred Hutchinson Cancer
                 Research Center
                       1100 Fairview Ave. N.
                       PO Box 19024 Seattle, WA 98109

                       Location: Arnold Building M1 B861
                       Phone: (206) 667-2793
<tel:%28206%29%20667-2793>
                 <tel:%28206%29%20667-2793>




             --
             Dr. Martin Morgan, PhD

             Fred Hutchinson Cancer Research Center
             1100 Fairview Ave. N.
             PO Box 19024 Seattle, WA 98109


                  [[alternative HTML version deleted]]

         ______________________________**___________________
         Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.**org<
Bioc-devel@r-project.org>

         mailing list
         https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<
https://stat.ethz.ch/mailman/__listinfo/bioc-devel>

         <https://stat.ethz.ch/mailman/**listinfo/bioc-devel<
https://stat.ethz.ch/mailman/listinfo/bioc-devel>




     --
     Hervé Pagès

     Program in Computational Biology
     Division of Public Health Sciences

     Fred Hutchinson Cancer Research Center
     1100 Fairview Ave. N, M1-B514
     P.O. Box 19024
     Seattle, WA 98109-1024

     E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319


         [[alternative HTML version deleted]]


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel




         [[alternative HTML version deleted]]


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



        [[alternative HTML version deleted]]



_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to