On 09/17/2013 11:30 AM, Michael Lawrence wrote:
Sounds great. Could we have a more intuitive name? Like naturalOrder().

That would be confusing because order() does something different:

  > order(c("chrY", "chr1", "chr9", "chr2"))
  [1] 2 4 3 1

  > makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2"))
  [1] 4 1 3 2

makeSeqnameIds() is more like rank() except on how it deals with
duplicates:

  > makeSeqnameIds(c("chrY", "chr1", "chr9", "chr2", "chr9"))
  [1] 4 1 3 2 3

Whatever 'ties.method' you use, rank() will put "chrY" at rank 5.

IIRC makeSeqnameIds() needs to support duplicates because of how it's
used internally in the makeTranscriptDb* functions.

Alternatively we could add sortSeqlevels() as a wrapper to
makeSeqnameIds() so you would be able to do:

  seqlevels(gr) <- sortSeqlevels(seqlevels(gr))

Does that work?

Thanks,
H.


And then the method to fix the GRanges could be restoreNaturalOrder().
Just some suggestions...




On Tue, Sep 17, 2013 at 10:58 AM, Hervé Pagès <hpa...@fhcrc.org
<mailto:hpa...@fhcrc.org>> wrote:

    For sorting the chromosome names in "natural" order, you can use
    the makeSeqnameIds() helper function in GenomicRanges:

       seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(__seqlevels(gr))]

    It recognizes several naming conventions (chr-prefixed or not,
    roman, etc...) e.g.:

       > makeSeqnameIds(c("Y", "1", "9", "M", "2"))
       [1] 4 1 3 5 2

       > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII"))
       [1] 4 1 3 5 2

    I still need to improve the documentation of this function, put
    more examples, and add unit tests. It's used internally by the
    makeTranscriptDb* functions in GenomicFeatures to assign internal
    ids to the chromosomes so that all the TranscriptDb extractors
    can then return GRanges and GRangesList objects with the seqlevels
    in the "natural" order.

    Cheers,
    H.



    On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote:

        Actually, what I (and I think several others) just want is
            fixSeqnames()
        which sorts and fixes them to some convention.  I don't really
        care which
        one, but I want to do some easy harmonization, preferably in
        line with
        other bioc tools.  I know this is hard to do for all organisms,
        but having
        something that deals with the major ones (human, mouse, fruit
        fly, yeast,
        some monkeys) would be extremely valuable.  If this function
        existed, I
        could just throw all my GRanges at it.

        Kasper


        On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence
        <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>

            wrote:


            When I hit this I usually just do:

            seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y"))

            It would be nice if there were a function for sorting
            chromosome names
            according to a specified naming convention. Like
            sortSeqlevels(gr,
            UCSCNaming()) or the alternative replacement syntax:
            seqinfo(gr) <-
            sort(seqinfo(gr), UCSCNaming()). Although sort is only
            single-dispatch.

            Michael






            On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey
            <st...@channing.harvard.edu
            <mailto:st...@channing.harvard.edu>>__wrote:

                I am replying to this as Michael mentions that this is a
                powerful
                operation.  Here's my
                problem that I believe is related, but I do not see a
                straightforward
                solution

                    bhmm19uni

                GRanges with 559456 ranges and 4 metadata columns:
                           seqnames                 ranges strand   |
                            name
                score
                              <Rle>              <IRanges>  <Rle>   |
                     <character>
                <numeric>
                         1     chr1 [ 67229025,  67341224]      *   |
                13_Heterochrom/lo
                      0
                         2     chr1 [203057955, 203060154]      *   |
                    12_Repressed
                      0
                         3     chr1 [  8458427,   8486226]      *   |
                13_Heterochrom/lo
                      0
                         4     chr1 [ 16904427,  16904826]      *   |
                5_Strong_Enhancer
                      0
                         5     chr1 [ 25289427,  25293426]      *   |
                     11_Weak_Txn
                      0
                       ...      ...                    ...    ... ...
                             ...
                    ...
                    570766    chr22   [51100931, 51101330]      *   |
                2_Weak_Promoter
                      0
                    570767    chr22   [51101331, 51101530]      *   |
                6_Weak_Enhancer
                      0
                    570768    chr22   [51101531, 51101730]      *   |
                2_Weak_Promoter
                      0
                    570769    chr22   [51101731, 51178130]      *   |
                13_Heterochrom/lo
                      0
                    570770    chr22   [51178131, 51178330]      *   |
                15_Repetitive/CNV
                      0
                                            thick     numcode
                                        <IRanges> <character>
                         1 [ 67001613,  67113812]          13
                         2 [201324578, 201326777]          12
                         3 [  8381014,   8408813]          13
                         4 [ 16777014,  16777413]           5
                         5 [ 25162014,  25166013]          11
                       ...                    ...         ...
                    570766   [49447797, 49448196]           2
                    570767   [49448197, 49448396]           6
                    570768   [49448397, 49448596]           2
                    570769   [49448597, 49524996]          13
                    570770   [49524997, 49525196]          15
                    ---
                    seqlengths:
                          chr1     chr10     chr11      chr5 ...
                  chr2     chr21
                   chr4
                     249250621 135534747 135006516 180915260 ...
                243199373  48129895
                191154276

                If I try to make a karyogram with ggbio, the chromosomes
                are in an
                unnatural order.  What is the right approach to getting
                the seqinfo in a
                natural order with respect to chromosome names and
                lengths?  Reassigning
                seqlevels seems dangerous but I haven't experimented
                fully.  It must be a
                common use case but I do not see it in doc.



                On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence <
                lawrence.mich...@gene.com
                <mailto:lawrence.mich...@gene.com>> wrote:

                    On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès
                    <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>> wrote:

                        Michael,


                        On 05/20/2013 08:44 PM, Michael Lawrence wrote:




                            On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès
                            <hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
                            <mailto:hpa...@fhcrc.org
                            <mailto:hpa...@fhcrc.org>>> wrote:

                                  Hi,


                                  On 05/20/2013 03:15 PM, Michael
                            Lawrence wrote:

                                      On Mon, May 20, 2013 at 3:11 PM,
                            Martin Morgan
                                      <mtmor...@fhcrc.org
                            <mailto:mtmor...@fhcrc.org>
                            <mailto:mtmor...@fhcrc.org
                            <mailto:mtmor...@fhcrc.org>>> wrote:

                                          Hi Michael --


                                          On 5/20/2013 5:56 AM, Michael
                            Lawrence wrote:

                                              While it's cool that
                            seqlevels<- has become so

                    flexible,

                                              I still claim
                                              that
                                              rename/keep/drop would be
                            a lot easier to read,

            because

                                              they describe the
                                              high-level operation, and
                            there's no reason to

                    deprecate

                                              them. They're
                                              also
                                              easier to remember. For
                            example, for dropping with
                                              seqlevels<-, the user
                                              needs
                                              to remember that "force"
                            is necessary to drop. Much
                                              easier to just say
                                              "dropSeqlevels(), please".
                            And reimplementing
                                              keepSeqlevels is still too
                                              complicated. Is it such a
                            maintenance burden to have
                                              these simple
                                              wrappers sit
                                              on top of the low-level
                            manipulators?

                                              Another suggestion:
                            renameSeqlevels should not

            require

                    a

                                              named vector (in
                                              cases
                                              where it is unnamed,
                            require that it is parallel to

            the

                                              existing
                                              seqlevels, as
                                              with seqlevels<-).


                                          I didn't really indicate what
                            drove my desire to see
                                          keepSeqlevels
                                          deprecated. keepSeqlevels,
                            seqlevels<-, and isActiveSeq

                    were

                                          developed more
                                          or less independently, and
                            have different contracts. The
                                          different
                                          contracts are confusing to the
                            user, as is creating a

                    usable

                                          help system
                                          when there are multiple end
                            points for what is a common
                                          operation. The help
                                          pages of each were inadequate
                            in different ways. And

                    because

                                          the code was
                                          developed independently,
                            support for different objects

            was

                                          inconsistent. So
                                          actually, yes, the maintenance
                            (and use) burden for the
                                          previous state of
                                          affairs was substantial.

                                          On the other hand, I agree
                            that keepSeqlevels is

            convenient

                                          as a simple
                                          wrapper around seqlevels<-, in
                            the same way that setNames
                                          and names<- are
                                          both useful.

                                          So we could iterate to

                                              keepSeqlevels <-
                                                  function(x, value, ...)
                                              {
                                                  seqlevels(x,
                            force=TRUE) <- value
                                                  x
                                              }

                                          but I'd be less enthusiastic
                            about maintaining the

            original

                                          contract of
                                          keepSeqlevels, where
                            keepSeqlevels(gr1, gr2), would have
                                          worked for two
                                          GRanges objects.


                                  Why would this be called
                            keepSeqlevels() if, depending on how

            it's

                            used,
                                  it will either add, drop, rename, or
                            permute the seqlevels?
                                  Couldn't this be called setSeqlevels?


                            I thought that new2old was necessary for
                            permuting.


                        The seqlevels setter has no 'new2old' arg.


                           You're right that

                            adding should be disallowed for
                            keepSeqlevels(). Adding seqlevels is

                    not

                            a common operation. The two common
                            operations are:
                            * Permuting, usually because the data were
                            imported in non-canonical
                            order (seqnameStyle was designed to address
                            this, no?).


                        Permuting is straightforward with seqlevels<-:

                            > seqlevels(gr)
                            [1] "chr1" "chr2" "chr3"

                            > seqlevels(gr) <- rev(seqlevels(gr))

                            > seqlevels(gr)
                            [1] "chr3" "chr2" "chr1"


                           * Subsetting, either by keeping or dropping.



                        Also straightforward:

                            > seqlevels(gr, force=TRUE) <- "chr2"

                            > seqlevels(gr)
                            [1] "chr2"


                           Two main reasons: taking a

                            small slice of the data for prototyping, or
                            removing problematic
                            chromosomes (sex, circular). This goes back
                            to bsapply and the

            exclude

                            argument.


                        There are other important use cases:

                            - Dropping seqlevels that are *not* in use.
                        This happens for example
                              when subsetting a BAM file with
                        Rsamtools::filterBam to keep only
                              the alignments located on a given
                        chromosome. The entire sequence
                              dictionary (in the header) is propagated
                        to the resulting BAM file
                              so when you read the file with
                        readGAlignments() you end up with a
                              bunch of useless seqlevels that you
                        generally start by dropping.
                              You don't want to drop any alignment so
                        force=FALSE is your

            friend.



                    This is sort of tangential to the discussion, but do
                    you really want to

            do

                    this? I would preserve the universe as given by the BAM.


                            - An even more common operation is renaming:
                        90% of the times that's
                              what I use seqlevels<- for, and that's
                        what I tell people to use
                              when they need to rename their
                        chromosomes. Also straightforward:

                                > seqlevels(gr)
                                [1] "chr1" "chr2" "chr3" "chrM"

                                > seqlevels(gr) <- sub("^chr", "",
                        seqlevels(gr))
                                > seqlevels(gr)
                                [1] "1" "2" "3" "M"

                                > seqlevels(gr)[seqlevels(gr) == "M"] <-
                        "MT"
                                > seqlevels(gr)
                                [1] "1"  "2"  "3"  "MT"

                              Note that this is simpler to use than
                        renameSeqlevels() which
                              always required you to build the entire
                        right value as a named
                              vector.


                    Sure, renaming is a common use case. Not sure how I
                    forgot that.

                    As I wrote earlier in the thread, renameSeqlevels
                    should be changed so

            as

                    not to require naming of the vector.


                        So the added-value of keepSeqlevels() seems to
                        boil down to:

                            (a) it always uses force=TRUE

                            (b) it preserves the original object and
                        returns the modified object

                        If you want to restrict the use of
                        keepSeqlevels() to permuting and
                        dropping, you'll also have to disallow renaming
                        (in addition to

            adding).

                        Then its name will more or less reflect what it
                        does (if "keep" means
                        "permute" or "drop"). The final result will be
                        something that does

            less

                        than setSeqlevels() and that is not necessarily
                        easier to read: both
                        will set on the object whatever vector of
                        seqlevels they are supplied,
                        except that one will fail when doing so would
                        actually result in

            adding

                        or renaming seqlevels.


                    So it looks like seqlevels<- is pretty powerful. Now
                    I see that if I
                    scroll
                    down to the examples in the man page, I find the
                    actual documentation.
                    It's
                    cool to learn that we can replace seqlevels on TxDb
                    objects. My argument
                    has always been that since these low-level
                    operations are so powerful,
                    it's
                    nice to have high-level operations to clarify the
                    code. We have to think
                    hard to know what the RHS will do in that
                    replacement. It's probably one
                    of
                    the most complex replacement operations; far more
                    complex than the

            typical

                    one, including levels<- on factor. There's nothing
                    wrong with that; it's
                    just complexity that we should be able to hide.

                    Michael

                    H.



                            Michael


                                  H.



                                      Well, I wasn't even aware of that
                            feature, so you've made

            your

                                      point about
                                      the documentation ;) Sounds like a
                            good compromise to me.

                                      Thanks for understanding,
                                      Michael



                                          Martin


                                              Michael






                                              On Sat, May 18, 2013 at
                            6:00 PM, Martin Morgan
                                              <mtmor...@fhcrc.org
                            <mailto:mtmor...@fhcrc.org>
                            <mailto:mtmor...@fhcrc.org
                            <mailto:mtmor...@fhcrc.org>>
                                              <mailto:mtmor...@fhcrc.org
                            <mailto:mtmor...@fhcrc.org> <mailto:

            mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>



                                              wrote:

                                                    On 05/18/2013 05:42
                            PM, Martin Morgan wrote:

                                                        Some of the most
                            common operations are
                                              straight-forward, e.g.,

                                                              > gr =
                            GRanges(paste0("chr", c(1:22,
                                              "X", "Y")), IRanges(1,
                                              100))
                                                              >
                            seqlevels(gr) = sub("chr", "ch",
                                              seqlevels(gr))


                                                    To get a more
                            comprehensive example I should

            have

                                              followed my own
                                              advice and
                                                    grabbed from the
                            help page!

                                                          ## Rename:

                              seqlevels(txdb) <- sub("chr", "",
                                              seqlevels(txdb))
                                                          seqlevels(txdb)


                              seqlevels(txdb) <- paste0("CH",
                            seqlevels(txdb))
                                                          seqlevels(txdb)


                              seqlevels(txdb)[seqlevels(__**__**__txdb)

            ==



                                              "CHM"] <- "M"


                                                          seqlevels(txdb)


                                                    --
                                                    Computational
                            Biology / Fred Hutchinson Cancer
                                              Research Center
                                                    1100 Fairview Ave. N.
                                                    PO Box 19024
                            Seattle, WA 98109

                                                    Location: Arnold
                            Building M1 B861
                                                    Phone: (206)
                            667-2793 <tel:%28206%29%20667-2793>

            <tel:%28206%29%20667-2793>

                                              <tel:%28206%29%20667-2793>




                                          --
                                          Dr. Martin Morgan, PhD

                                          Fred Hutchinson Cancer
                            Research Center
                                          1100 Fairview Ave. N.
                                          PO Box 19024 Seattle, WA 98109


                                               [[alternative HTML
                            version deleted]]


                              
________________________________**___________________
                            Bioc-devel@r-project.org
                            <mailto:Bioc-devel@r-project.org>
                            <mailto:Bioc-devel@r-project.
                            <mailto:Bioc-devel@r-project.>*__*org<

                    Bioc-devel@r-project.org
                    <mailto:Bioc-devel@r-project.org>>


                                      mailing list
                            
https://stat.ethz.ch/mailman/___**_listinfo/bioc-devel
                            
<https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel><

                    https://stat.ethz.ch/mailman/____listinfo/bioc-devel
                    <https://stat.ethz.ch/mailman/__listinfo/bioc-devel>>



                              <https://stat.ethz.ch/mailman/__**listinfo/bioc-devel 
<https://stat.ethz.ch/mailman/**listinfo/bioc-devel><

                    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>>




                                  --
                                  Hervé Pagès

                                  Program in Computational Biology
                                  Division of Public Health Sciences

                                  Fred Hutchinson Cancer Research Center
                                  1100 Fairview Ave. N, M1-B514
                                  P.O. Box 19024
                                  Seattle, WA 98109-1024

                                  E-mail: hpa...@fhcrc.org
                            <mailto:hpa...@fhcrc.org>
                            <mailto:hpa...@fhcrc.org
                            <mailto:hpa...@fhcrc.org>>
                                  Phone: (206) 667-5791
                            <tel:%28206%29%20667-5791>
                            <tel:%28206%29%20667-5791>
                                  Fax: (206) 667-1319
                            <tel:%28206%29%20667-1319>
                            <tel:%28206%29%20667-1319>



                        --
                        Hervé Pagès

                        Program in Computational Biology
                        Division of Public Health Sciences
                        Fred Hutchinson Cancer Research Center
                        1100 Fairview Ave. N, M1-B514
                        P.O. Box 19024
                        Seattle, WA 98109-1024

                        E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
                        Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
                        Fax: (206) 667-1319 <tel:%28206%29%20667-1319>


                              [[alternative HTML version deleted]]


                    _________________________________________________
                    Bioc-devel@r-project.org
                    <mailto:Bioc-devel@r-project.org> mailing list
                    https://stat.ethz.ch/mailman/__listinfo/bioc-devel
                    <https://stat.ethz.ch/mailman/listinfo/bioc-devel>




                      [[alternative HTML version deleted]]


            _________________________________________________
            Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
            mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>



                 [[alternative HTML version deleted]]



        _________________________________________________
        Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


    --
    Hervé Pagès

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to