OK I just made sortSeqlevels() a generic with a method for character
vectors and a default method. So you can do sortSeqlevels() directly on
a Seqinfo object, a GRanges object and anything that contains a Seqinfo
component:

  > gr
  GRanges with 0 ranges and 0 metadata columns:
     seqnames    ranges strand
        <Rle> <IRanges>  <Rle>
    ---
    seqlengths:
chr2RHet chr4 chrUextra chrYHet ... chr3R chr2R chrX NA NA NA NA ... NA NA NA

  > sortSeqlevels(gr)
  GRanges with 0 ranges and 0 metadata columns:
     seqnames    ranges strand
        <Rle> <IRanges>  <Rle>
    ---
    seqlengths:
chr2R chr3L chr3R chr4 ... chrXHet chrYHet chrUextra NA NA NA NA ... NA NA NA

H.

On 09/17/2013 01:05 PM, Michael Lawrence wrote:
Thanks, Hervé, that's great.

Would it make sense to have a sort or sortSeqlevels method for Seqinfo
and then a sortSeqlevels,ANY method that just does seqinfo(x) <-
sort(seqinfo(x))?

Michael



On Tue, Sep 17, 2013 at 12:56 PM, Hervé Pagès <hpa...@fhcrc.org
<mailto:hpa...@fhcrc.org>> wrote:

    On 09/17/2013 10:58 AM, Hervé Pagès wrote:

        For sorting the chromosome names in "natural" order, you can use
        the makeSeqnameIds() helper function in GenomicRanges:

            seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(__seqlevels(gr))]


    I meant:

       seqlevels(gr) <-
    seqlevels(gr)[order(__makeSeqnameIds(seqlevels(gr)))__]

    Probably the source of Michael's confusion... sorry.

    sortSeqlevels() just added to GenomicRanges 1.13.44:

       seqlevels <- c("chr2RHet", "chr4", "chrUextra", "chrYHet",
                       "chrM", "chrXHet", "chr2LHet", "chrU",
                       "chr3L", "chr3R", "chr2R", "chrX")

    Then:

       > sortSeqlevels(seqlevels)
        [1] "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX" "chrU"
        [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"   "chrYHet"
    "chrUextra"

       > sortSeqlevels(seqlevels, X.is.sexchrom=TRUE)
        [1] "chr2R"     "chr3L"     "chr3R"     "chr4"      "chrX" "chrU"
        [7] "chrM"      "chr2LHet"  "chr2RHet"  "chrXHet"   "chrYHet"
    "chrUextra"

    H.




        It recognizes several naming conventions (chr-prefixed or not,
        roman, etc...) e.g.:

            > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
            [1] 4 1 3 5 2

            > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
            [1] 4 1 3 5 2

        I still need to improve the documentation of this function, put
        more examples, and add unit tests. It's used internally by the
        makeTranscriptDb* functions in GenomicFeatures to assign internal
        ids to the chromosomes so that all the TranscriptDb extractors
        can then return GRanges and GRangesList objects with the seqlevels
        in the "natural" order.

        Cheers,
        H.


        On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:

            Actually, what I (and I think several others) just want is
                fixSeqnames()
            which sorts and fixes them to some convention.  I don't
            really care which
            one, but I want to do some easy harmonization, preferably in
            line with
            other bioc tools.  I know this is hard to do for all
            organisms, but
            having
            something that deals with the major ones (human, mouse,
            fruit fly, yeast,
            some monkeys) would be extremely valuable.  If this function
            existed, I
            could just throw all my GRanges at it.

            Kasper


            On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
            <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>

                wrote:


                When I hit this I usually just do:

                seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

                It would be nice if there were a function for sorting
                chromosome names
                according to a specified naming convention. Like
                sortSeqlevels(gr,
                UCSCNaming()) or the alternative replacement syntax:
                seqinfo(gr) <-
                sort(seqinfo(gr), UCSCNaming()). Although sort is only
                single-dispatch.

                Michael






                On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
                <st...@channing.harvard.edu
                <mailto:st...@channing.harvard.edu>>__wrote:

                    I am replying to this as Michael mentions that this
                    is a powerful
                    operation.  Here's my
                    problem that I believe is related, but I do not see
                    a straightforward
                    solution

                        bhmm19uni

                    GRanges with 559456 ranges and 4 metadata columns:
                               seqnames                 ranges strand
                    |              name
                    score
                                  <Rle>              <IRanges>  <Rle>
                    |       <character>
                    <numeric>
                             1     chr1 [ 67229025,  67341224]      *
                    | 13_Heterochrom/lo
                          0
                             2     chr1 [203057955, 203060154]      *
                    |      12_Repressed
                          0
                             3     chr1 [  8458427,   8486226]      *
                    | 13_Heterochrom/lo
                          0
                             4     chr1 [ 16904427,  16904826]      *
                    | 5_Strong_Enhancer
                          0
                             5     chr1 [ 25289427,  25293426]      *
                    |       11_Weak_Txn
                          0
                           ...      ...                    ...    ...
                    ...               ...
                        ...
                        570766    chr22   [51100931, 51101330]      *
                    |   2_Weak_Promoter
                          0
                        570767    chr22   [51101331, 51101530]      *
                    |   6_Weak_Enhancer
                          0
                        570768    chr22   [51101531, 51101730]      *
                    |   2_Weak_Promoter
                          0
                        570769    chr22   [51101731, 51178130]      *
                    | 13_Heterochrom/lo
                          0
                        570770    chr22   [51178131, 51178330]      *
                    | 15_Repetitive/CNV
                          0
                                                thick     numcode
                                            <IRanges> <character>
                             1 [ 67001613,  67113812]          13
                             2 [201324578, 201326777]          12
                             3 [  8381014,   8408813]          13
                             4 [ 16777014,  16777413]           5
                             5 [ 25162014,  25166013]          11
                           ...                    ...         ...
                        570766   [49447797, 49448196]           2
                        570767   [49448197, 49448396]           6
                        570768   [49448397, 49448596]           2
                        570769   [49448597, 49524996]          13
                        570770   [49524997, 49525196]          15
                        ---
                        seqlengths:
                              chr1     chr10     chr11      chr5 ...
                      chr2     chr21
                       chr4
                         249250621 135534747 135006516 180915260 ...
                    243199373  48129895
                    191154276

                    If I try to make a karyogram with ggbio, the
                    chromosomes are in an
                    unnatural order.  What is the right approach to
                    getting the seqinfo
                    in a
                    natural order with respect to chromosome names and
                    lengths?
                    Reassigning
                    seqlevels seems dangerous but I haven't experimented
                    fully.  It must
                    be a
                    common use case but I do not see it in doc.



                    On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
                    lawrence.mich...@gene.com
                    <mailto:lawrence.mich...@gene.com>> wrote:

                        On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès
                        <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>>
                        wrote:

                            Michael,


                            On 05/20/2013 08:44 PM, Michael Lawrence wrote:




                                On Mon, May 20, 2013 at 4:13 PM, Hervé
                                Pagès <hpa...@fhcrc.org
                                <mailto:hpa...@fhcrc.org>
                                <mailto:hpa...@fhcrc.org
                                <mailto:hpa...@fhcrc.org>>> wrote:

                                      Hi,


                                      On 05/20/2013 03:15 PM, Michael
                                Lawrence wrote:

                                          On Mon, May 20, 2013 at 3:11
                                PM, Martin Morgan
                                          <mtmor...@fhcrc.org
                                <mailto:mtmor...@fhcrc.org>
                                <mailto:mtmor...@fhcrc.org
                                <mailto:mtmor...@fhcrc.org>>> wrote:

                                              Hi Michael --


                                              On 5/20/2013 5:56 AM,
                                Michael Lawrence wrote:

                                                  While it's cool that
                                seqlevels<- has become so

                        flexible,

                                                  I still claim
                                                  that
                                                  rename/keep/drop would
                                be a lot easier to read,

                because

                                                  they describe the
                                                  high-level operation,
                                and there's no reason to

                        deprecate

                                                  them. They're
                                                  also
                                                  easier to remember.
                                For example, for dropping with
                                                  seqlevels<-, the user
                                                  needs
                                                  to remember that
                                "force" is necessary to drop. Much
                                                  easier to just say
                                                  "dropSeqlevels(),
                                please". And reimplementing
                                                  keepSeqlevels is still too
                                                  complicated. Is it
                                such a maintenance burden to
                                have
                                                  these simple
                                                  wrappers sit
                                                  on top of the
                                low-level manipulators?

                                                  Another suggestion:
                                renameSeqlevels should not

                require

                        a

                                                  named vector (in
                                                  cases
                                                  where it is unnamed,
                                require that it is parallel to

                the

                                                  existing
                                                  seqlevels, as
                                                  with seqlevels<-).


                                              I didn't really indicate
                                what drove my desire to see
                                              keepSeqlevels
                                              deprecated. keepSeqlevels,
                                seqlevels<-, and isActiveSeq

                        were

                                              developed more
                                              or less independently, and
                                have different contracts.
                                The
                                              different
                                              contracts are confusing to
                                the user, as is creating a

                        usable

                                              help system
                                              when there are multiple
                                end points for what is a common
                                              operation. The help
                                              pages of each were
                                inadequate in different ways. And

                        because

                                              the code was
                                              developed independently,
                                support for different objects

                was

                                              inconsistent. So
                                              actually, yes, the
                                maintenance (and use) burden for the
                                              previous state of
                                              affairs was substantial.

                                              On the other hand, I agree
                                that keepSeqlevels is

                convenient

                                              as a simple
                                              wrapper around
                                seqlevels<-, in the same way that
                                setNames
                                              and names<- are
                                              both useful.

                                              So we could iterate to

                                                  keepSeqlevels <-
                                                      function(x, value,
                                ...)
                                                  {
                                                      seqlevels(x,
                                force=TRUE) <- value
                                                      x
                                                  }

                                              but I'd be less
                                enthusiastic about maintaining the

                original

                                              contract of
                                              keepSeqlevels, where
                                keepSeqlevels(gr1, gr2), would
                                have
                                              worked for two
                                              GRanges objects.


                                      Why would this be called
                                keepSeqlevels() if, depending on how

                it's

                                used,
                                      it will either add, drop, rename,
                                or permute the seqlevels?
                                      Couldn't this be called setSeqlevels?


                                I thought that new2old was necessary for
                                permuting.


                            The seqlevels setter has no 'new2old' arg.


                               You're right that

                                adding should be disallowed for
                                keepSeqlevels(). Adding seqlevels is

                        not

                                a common operation. The two common
                                operations are:
                                * Permuting, usually because the data
                                were imported in non-canonical
                                order (seqnameStyle was designed to
                                address this, no?).


                            Permuting is straightforward with seqlevels<-:

                                > seqlevels(gr)
                                [1] "chr1" "chr2" "chr3"

                                > seqlevels(gr) <- rev(seqlevels(gr))

                                > seqlevels(gr)
                                [1] "chr3" "chr2" "chr1"


                               * Subsetting, either by keeping or dropping.



                            Also straightforward:

                                > seqlevels(gr, force=TRUE) <- "chr2"

                                > seqlevels(gr)
                                [1] "chr2"


                               Two main reasons: taking a

                                small slice of the data for prototyping,
                                or removing problematic
                                chromosomes (sex, circular). This goes
                                back to bsapply and the

                exclude

                                argument.


                            There are other important use cases:

                                - Dropping seqlevels that are *not* in
                            use. This happens for
                            example
                                  when subsetting a BAM file with
                            Rsamtools::filterBam to keep
                            only
                                  the alignments located on a given
                            chromosome. The entire
                            sequence
                                  dictionary (in the header) is
                            propagated to the resulting BAM
                            file
                                  so when you read the file with
                            readGAlignments() you end up
                            with a
                                  bunch of useless seqlevels that you
                            generally start by dropping.
                                  You don't want to drop any alignment
                            so force=FALSE is your

                friend.



                        This is sort of tangential to the discussion,
                        but do you really
                        want to

                do

                        this? I would preserve the universe as given by
                        the BAM.


                                - An even more common operation is
                            renaming: 90% of the times
                            that's
                                  what I use seqlevels<- for, and that's
                            what I tell people to use
                                  when they need to rename their
                            chromosomes. Also
                            straightforward:

                                    > seqlevels(gr)
                                    [1] "chr1" "chr2" "chr3" "chrM"

                                    > seqlevels(gr) <- sub("^chr", "",
                            seqlevels(gr))
                                    > seqlevels(gr)
                                    [1] "1" "2" "3" "M"

                                    > seqlevels(gr)[seqlevels(gr) ==
                            "M"] <- "MT"
                                    > seqlevels(gr)
                                    [1] "1"  "2"  "3"  "MT"

                                  Note that this is simpler to use than
                            renameSeqlevels() which
                                  always required you to build the
                            entire right value as a named
                                  vector.


                        Sure, renaming is a common use case. Not sure
                        how I forgot that.

                        As I wrote earlier in the thread,
                        renameSeqlevels should be changed so

                as

                        not to require naming of the vector.


                            So the added-value of keepSeqlevels() seems
                            to boil down to:

                                (a) it always uses force=TRUE

                                (b) it preserves the original object and
                            returns the modified
                            object

                            If you want to restrict the use of
                            keepSeqlevels() to permuting and
                            dropping, you'll also have to disallow
                            renaming (in addition to

                adding).

                            Then its name will more or less reflect what
                            it does (if "keep" means
                            "permute" or "drop"). The final result will
                            be something that does

                less

                            than setSeqlevels() and that is not
                            necessarily easier to read: both
                            will set on the object whatever vector of
                            seqlevels they are
                            supplied,
                            except that one will fail when doing so
                            would actually result in

                adding

                            or renaming seqlevels.


                        So it looks like seqlevels<- is pretty powerful.
                        Now I see that if I
                        scroll
                        down to the examples in the man page, I find the
                        actual documentation.
                        It's
                        cool to learn that we can replace seqlevels on
                        TxDb objects. My
                        argument
                        has always been that since these low-level
                        operations are so powerful,
                        it's
                        nice to have high-level operations to clarify
                        the code. We have to
                        think
                        hard to know what the RHS will do in that
                        replacement. It's
                        probably one
                        of
                        the most complex replacement operations; far
                        more complex than the

                typical

                        one, including levels<- on factor. There's
                        nothing wrong with that;
                        it's
                        just complexity that we should be able to hide.

                        Michael

                        H.



                                Michael


                                      H.



                                          Well, I wasn't even aware of
                                that feature, so you've made

                your

                                          point about
                                          the documentation ;) Sounds
                                like a good compromise to me.

                                          Thanks for understanding,
                                          Michael



                                              Martin


                                                  Michael






                                                  On Sat, May 18, 2013
                                at 6:00 PM, Martin Morgan
                                                  <mtmor...@fhcrc.org
                                <mailto:mtmor...@fhcrc.org>
                                <mailto:mtmor...@fhcrc.org
                                <mailto:mtmor...@fhcrc.org>>

                                  <mailto:mtmor...@fhcrc.org
                                <mailto:mtmor...@fhcrc.org> <mailto:

                mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>



                                                  wrote:

                                                        On 05/18/2013
                                05:42 PM, Martin Morgan wrote:

                                                            Some of the
                                most common operations are
                                                  straight-forward, e.g.,

                                                                  > gr =
                                GRanges(paste0("chr",
                                c(1:22,
                                                  "X", "Y")), IRanges(1,
                                                  100))
                                                                  >
                                seqlevels(gr) = sub("chr", "ch",
                                                  seqlevels(gr))


                                                        To get a more
                                comprehensive example I should

                have

                                                  followed my own
                                                  advice and
                                                        grabbed from the
                                help page!

                                                              ## Rename:

                                  seqlevels(txdb) <- sub("chr", "",
                                                  seqlevels(txdb))

                                  seqlevels(txdb)


                                  seqlevels(txdb) <- paste0("CH",
                                seqlevels(txdb))

                                  seqlevels(txdb)


                                  seqlevels(txdb)[seqlevels(__**__**__txdb)

                ==



                                                  "CHM"] <- "M"



                                  seqlevels(txdb)


                                                        --
                                                        Computational
                                Biology / Fred Hutchinson
                                Cancer
                                                  Research Center
                                                        1100 Fairview
                                Ave. N.
                                                        PO Box 19024
                                Seattle, WA 98109

                                                        Location: Arnold
                                Building M1 B861
                                                        Phone: (206)
                                667-2793 <tel:%28206%29%20667-2793>

                <tel:%28206%29%20667-2793>

                                                  <tel:%28206%29%20667-2793>




                                              --
                                              Dr. Martin Morgan, PhD

                                              Fred Hutchinson Cancer
                                Research Center
                                              1100 Fairview Ave. N.
                                              PO Box 19024 Seattle, WA 98109


                                                   [[alternative HTML
                                version deleted]]


                                  
________________________________**___________________
                                Bioc-devel@r-project.org
                                <mailto:Bioc-devel@r-project.org>
                                <mailto:Bioc-devel@r-project.
                                <mailto:Bioc-devel@r-project.>*__*org<

                        Bioc-devel@r-project.org
                        <mailto:Bioc-devel@r-project.org>>


                                          mailing list
                                
https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel
                                
<https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel><

                        https://stat.ethz.ch/mailman/____listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>



                                  <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel 
<https://stat.ethz.ch/mailman/**listinfo/bioc-devel><

                        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>




                                      --
                                      Hervé Pagès

                                      Program in Computational Biology
                                      Division of Public Health Sciences

                                      Fred Hutchinson Cancer Research Center
                                      1100 Fairview Ave. N, M1-B514
                                      P.O. Box 19024
                                      Seattle, WA 98109-1024

                                      E-mail: hpa...@fhcrc.org
                                <mailto:hpa...@fhcrc.org>
                                <mailto:hpa...@fhcrc.org
                                <mailto:hpa...@fhcrc.org>>
                                      Phone: (206) 667-5791
                                <tel:%28206%29%20667-5791>
                                <tel:%28206%29%20667-5791>
                                      Fax: (206) 667-1319
                                <tel:%28206%29%20667-1319>
                                <tel:%28206%29%20667-1319>



                            --
                            Hervé Pagès

                            Program in Computational Biology
                            Division of Public Health Sciences
                            Fred Hutchinson Cancer Research Center
                            1100 Fairview Ave. N, M1-B514
                            P.O. Box 19024
                            Seattle, WA 98109-1024

                            E-mail: hpa...@fhcrc.org
                            <mailto:hpa...@fhcrc.org>
                            Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
                            Fax: (206) 667-1319 <tel:%28206%29%20667-1319>


                                  [[alternative HTML version deleted]]


                        _________________________________________________
                        Bioc-devel@r-project.org
                        <mailto:Bioc-devel@r-project.org> mailing list
                        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>




                          [[alternative HTML version deleted]]


                _________________________________________________
                Bioc-devel@r-project.org
                <mailto:Bioc-devel@r-project.org> mailing list
                https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                <https://stat.ethz.ch/mailman/listinfo/bioc-devel>



                 [[alternative HTML version deleted]]



            _________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>



    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to