Sounds great. Could we have a more intuitive name? Like naturalOrder(). And then the method to fix the GRanges could be restoreNaturalOrder(). Just some suggestions...
On Tue, Sep 17, 2013 at 10:58 AM, Hervé Pagès <hpa...@fhcrc.org> wrote: > For sorting the chromosome names in "natural" order, you can use > the makeSeqnameIds() helper function in GenomicRanges: > > seqlevels(gr) <- seqlevels(gr)[makeSeqnameIds(**seqlevels(gr))] > > It recognizes several naming conventions (chr-prefixed or not, > roman, etc...) e.g.: > > > makeSeqnameIds(c("Y", "1", "9", "M", "2")) > [1] 4 1 3 5 2 > > > makeSeqnameIds(c("chrY", "chrI", "chrIX", "chrM", "chrII")) > [1] 4 1 3 5 2 > > I still need to improve the documentation of this function, put > more examples, and add unit tests. It's used internally by the > makeTranscriptDb* functions in GenomicFeatures to assign internal > ids to the chromosomes so that all the TranscriptDb extractors > can then return GRanges and GRangesList objects with the seqlevels > in the "natural" order. > > Cheers, > H. > > > > On 09/17/2013 10:23 AM, Kasper Daniel Hansen wrote: > >> Actually, what I (and I think several others) just want is >> fixSeqnames() >> which sorts and fixes them to some convention. I don't really care which >> one, but I want to do some easy harmonization, preferably in line with >> other bioc tools. I know this is hard to do for all organisms, but having >> something that deals with the major ones (human, mouse, fruit fly, yeast, >> some monkeys) would be extremely valuable. If this function existed, I >> could just throw all my GRanges at it. >> >> Kasper >> >> >> On Tue, Sep 17, 2013 at 1:14 PM, Michael Lawrence < >> lawrence.mich...@gene.com >> >>> wrote: >>> >> >> When I hit this I usually just do: >>> >>> seqlevels(gr) <- paste0("chr", c(1:22, "X", "Y")) >>> >>> It would be nice if there were a function for sorting chromosome names >>> according to a specified naming convention. Like sortSeqlevels(gr, >>> UCSCNaming()) or the alternative replacement syntax: seqinfo(gr) <- >>> sort(seqinfo(gr), UCSCNaming()). Although sort is only single-dispatch. >>> >>> Michael >>> >>> >>> >>> >>> >>> >>> On Tue, Sep 17, 2013 at 8:59 AM, Vincent Carey >>> <st...@channing.harvard.edu>**wrote: >>> >>> I am replying to this as Michael mentions that this is a powerful >>>> operation. Here's my >>>> problem that I believe is related, but I do not see a straightforward >>>> solution >>>> >>>> bhmm19uni >>>>> >>>> GRanges with 559456 ranges and 4 metadata columns: >>>> seqnames ranges strand | name >>>> score >>>> <Rle> <IRanges> <Rle> | <character> >>>> <numeric> >>>> 1 chr1 [ 67229025, 67341224] * | 13_Heterochrom/lo >>>> 0 >>>> 2 chr1 [203057955, 203060154] * | 12_Repressed >>>> 0 >>>> 3 chr1 [ 8458427, 8486226] * | 13_Heterochrom/lo >>>> 0 >>>> 4 chr1 [ 16904427, 16904826] * | 5_Strong_Enhancer >>>> 0 >>>> 5 chr1 [ 25289427, 25293426] * | 11_Weak_Txn >>>> 0 >>>> ... ... ... ... ... ... >>>> ... >>>> 570766 chr22 [51100931, 51101330] * | 2_Weak_Promoter >>>> 0 >>>> 570767 chr22 [51101331, 51101530] * | 6_Weak_Enhancer >>>> 0 >>>> 570768 chr22 [51101531, 51101730] * | 2_Weak_Promoter >>>> 0 >>>> 570769 chr22 [51101731, 51178130] * | 13_Heterochrom/lo >>>> 0 >>>> 570770 chr22 [51178131, 51178330] * | 15_Repetitive/CNV >>>> 0 >>>> thick numcode >>>> <IRanges> <character> >>>> 1 [ 67001613, 67113812] 13 >>>> 2 [201324578, 201326777] 12 >>>> 3 [ 8381014, 8408813] 13 >>>> 4 [ 16777014, 16777413] 5 >>>> 5 [ 25162014, 25166013] 11 >>>> ... ... ... >>>> 570766 [49447797, 49448196] 2 >>>> 570767 [49448197, 49448396] 6 >>>> 570768 [49448397, 49448596] 2 >>>> 570769 [49448597, 49524996] 13 >>>> 570770 [49524997, 49525196] 15 >>>> --- >>>> seqlengths: >>>> chr1 chr10 chr11 chr5 ... chr2 chr21 >>>> chr4 >>>> 249250621 135534747 135006516 180915260 ... 243199373 48129895 >>>> 191154276 >>>> >>>> If I try to make a karyogram with ggbio, the chromosomes are in an >>>> unnatural order. What is the right approach to getting the seqinfo in a >>>> natural order with respect to chromosome names and lengths? Reassigning >>>> seqlevels seems dangerous but I haven't experimented fully. It must be >>>> a >>>> common use case but I do not see it in doc. >>>> >>>> >>>> >>>> On Tue, May 21, 2013 at 11:00 AM, Michael Lawrence < >>>> lawrence.mich...@gene.com> wrote: >>>> >>>> On Mon, May 20, 2013 at 10:36 PM, Hervé Pagès <hpa...@fhcrc.org> >>>>> wrote: >>>>> >>>>> Michael, >>>>>> >>>>>> >>>>>> On 05/20/2013 08:44 PM, Michael Lawrence wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, May 20, 2013 at 4:13 PM, Hervé Pagès <hpa...@fhcrc.org >>>>>>> <mailto:hpa...@fhcrc.org>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> On 05/20/2013 03:15 PM, Michael Lawrence wrote: >>>>>>> >>>>>>> On Mon, May 20, 2013 at 3:11 PM, Martin Morgan >>>>>>> <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org>> wrote: >>>>>>> >>>>>>> Hi Michael -- >>>>>>> >>>>>>> >>>>>>> On 5/20/2013 5:56 AM, Michael Lawrence wrote: >>>>>>> >>>>>>> While it's cool that seqlevels<- has become so >>>>>>> >>>>>> flexible, >>>>> >>>>>> I still claim >>>>>>> that >>>>>>> rename/keep/drop would be a lot easier to read, >>>>>>> >>>>>> because >>> >>>> they describe the >>>>>>> high-level operation, and there's no reason to >>>>>>> >>>>>> deprecate >>>>> >>>>>> them. They're >>>>>>> also >>>>>>> easier to remember. For example, for dropping with >>>>>>> seqlevels<-, the user >>>>>>> needs >>>>>>> to remember that "force" is necessary to drop. Much >>>>>>> easier to just say >>>>>>> "dropSeqlevels(), please". And reimplementing >>>>>>> keepSeqlevels is still too >>>>>>> complicated. Is it such a maintenance burden to have >>>>>>> these simple >>>>>>> wrappers sit >>>>>>> on top of the low-level manipulators? >>>>>>> >>>>>>> Another suggestion: renameSeqlevels should not >>>>>>> >>>>>> require >>> >>>> a >>>>> >>>>>> named vector (in >>>>>>> cases >>>>>>> where it is unnamed, require that it is parallel to >>>>>>> >>>>>> the >>> >>>> existing >>>>>>> seqlevels, as >>>>>>> with seqlevels<-). >>>>>>> >>>>>>> >>>>>>> I didn't really indicate what drove my desire to see >>>>>>> keepSeqlevels >>>>>>> deprecated. keepSeqlevels, seqlevels<-, and isActiveSeq >>>>>>> >>>>>> were >>>>> >>>>>> developed more >>>>>>> or less independently, and have different contracts. The >>>>>>> different >>>>>>> contracts are confusing to the user, as is creating a >>>>>>> >>>>>> usable >>>>> >>>>>> help system >>>>>>> when there are multiple end points for what is a common >>>>>>> operation. The help >>>>>>> pages of each were inadequate in different ways. And >>>>>>> >>>>>> because >>>>> >>>>>> the code was >>>>>>> developed independently, support for different objects >>>>>>> >>>>>> was >>> >>>> inconsistent. So >>>>>>> actually, yes, the maintenance (and use) burden for the >>>>>>> previous state of >>>>>>> affairs was substantial. >>>>>>> >>>>>>> On the other hand, I agree that keepSeqlevels is >>>>>>> >>>>>> convenient >>> >>>> as a simple >>>>>>> wrapper around seqlevels<-, in the same way that >>>>>>> setNames >>>>>>> and names<- are >>>>>>> both useful. >>>>>>> >>>>>>> So we could iterate to >>>>>>> >>>>>>> keepSeqlevels <- >>>>>>> function(x, value, ...) >>>>>>> { >>>>>>> seqlevels(x, force=TRUE) <- value >>>>>>> x >>>>>>> } >>>>>>> >>>>>>> but I'd be less enthusiastic about maintaining the >>>>>>> >>>>>> original >>> >>>> contract of >>>>>>> keepSeqlevels, where keepSeqlevels(gr1, gr2), would have >>>>>>> worked for two >>>>>>> GRanges objects. >>>>>>> >>>>>>> >>>>>>> Why would this be called keepSeqlevels() if, depending on how >>>>>>> >>>>>> it's >>> >>>> used, >>>>>>> it will either add, drop, rename, or permute the seqlevels? >>>>>>> Couldn't this be called setSeqlevels? >>>>>>> >>>>>>> >>>>>>> I thought that new2old was necessary for permuting. >>>>>>> >>>>>>> >>>>>> The seqlevels setter has no 'new2old' arg. >>>>>> >>>>>> >>>>>> You're right that >>>>>> >>>>>>> adding should be disallowed for keepSeqlevels(). Adding seqlevels is >>>>>>> >>>>>> not >>>>> >>>>>> a common operation. The two common operations are: >>>>>>> * Permuting, usually because the data were imported in non-canonical >>>>>>> order (seqnameStyle was designed to address this, no?). >>>>>>> >>>>>>> >>>>>> Permuting is straightforward with seqlevels<-: >>>>>> >>>>>> > seqlevels(gr) >>>>>> [1] "chr1" "chr2" "chr3" >>>>>> >>>>>> > seqlevels(gr) <- rev(seqlevels(gr)) >>>>>> >>>>>> > seqlevels(gr) >>>>>> [1] "chr3" "chr2" "chr1" >>>>>> >>>>>> >>>>>> * Subsetting, either by keeping or dropping. >>>>>> >>>>>>> >>>>>>> >>>>>> Also straightforward: >>>>>> >>>>>> > seqlevels(gr, force=TRUE) <- "chr2" >>>>>> >>>>>> > seqlevels(gr) >>>>>> [1] "chr2" >>>>>> >>>>>> >>>>>> Two main reasons: taking a >>>>>> >>>>>>> small slice of the data for prototyping, or removing problematic >>>>>>> chromosomes (sex, circular). This goes back to bsapply and the >>>>>>> >>>>>> exclude >>> >>>> argument. >>>>>>> >>>>>>> >>>>>> There are other important use cases: >>>>>> >>>>>> - Dropping seqlevels that are *not* in use. This happens for >>>>>> example >>>>>> when subsetting a BAM file with Rsamtools::filterBam to keep only >>>>>> the alignments located on a given chromosome. The entire sequence >>>>>> dictionary (in the header) is propagated to the resulting BAM >>>>>> file >>>>>> so when you read the file with readGAlignments() you end up with >>>>>> a >>>>>> bunch of useless seqlevels that you generally start by dropping. >>>>>> You don't want to drop any alignment so force=FALSE is your >>>>>> >>>>> friend. >>> >>>> >>>>>> >>>>>> This is sort of tangential to the discussion, but do you really want >>>>> to >>>>> >>>> do >>> >>>> this? I would preserve the universe as given by the BAM. >>>>> >>>>> >>>>> - An even more common operation is renaming: 90% of the times >>>>>> that's >>>>>> what I use seqlevels<- for, and that's what I tell people to use >>>>>> when they need to rename their chromosomes. Also straightforward: >>>>>> >>>>>> > seqlevels(gr) >>>>>> [1] "chr1" "chr2" "chr3" "chrM" >>>>>> >>>>>> > seqlevels(gr) <- sub("^chr", "", seqlevels(gr)) >>>>>> > seqlevels(gr) >>>>>> [1] "1" "2" "3" "M" >>>>>> >>>>>> > seqlevels(gr)[seqlevels(gr) == "M"] <- "MT" >>>>>> > seqlevels(gr) >>>>>> [1] "1" "2" "3" "MT" >>>>>> >>>>>> Note that this is simpler to use than renameSeqlevels() which >>>>>> always required you to build the entire right value as a named >>>>>> vector. >>>>>> >>>>>> >>>>>> Sure, renaming is a common use case. Not sure how I forgot that. >>>>> >>>>> As I wrote earlier in the thread, renameSeqlevels should be changed so >>>>> >>>> as >>> >>>> not to require naming of the vector. >>>>> >>>>> >>>>> So the added-value of keepSeqlevels() seems to boil down to: >>>>>> >>>>>> (a) it always uses force=TRUE >>>>>> >>>>>> (b) it preserves the original object and returns the modified >>>>>> object >>>>>> >>>>>> If you want to restrict the use of keepSeqlevels() to permuting and >>>>>> dropping, you'll also have to disallow renaming (in addition to >>>>>> >>>>> adding). >>> >>>> Then its name will more or less reflect what it does (if "keep" means >>>>>> "permute" or "drop"). The final result will be something that does >>>>>> >>>>> less >>> >>>> than setSeqlevels() and that is not necessarily easier to read: both >>>>>> will set on the object whatever vector of seqlevels they are supplied, >>>>>> except that one will fail when doing so would actually result in >>>>>> >>>>> adding >>> >>>> or renaming seqlevels. >>>>>> >>>>>> >>>>>> So it looks like seqlevels<- is pretty powerful. Now I see that if I >>>>> scroll >>>>> down to the examples in the man page, I find the actual documentation. >>>>> It's >>>>> cool to learn that we can replace seqlevels on TxDb objects. My >>>>> argument >>>>> has always been that since these low-level operations are so powerful, >>>>> it's >>>>> nice to have high-level operations to clarify the code. We have to >>>>> think >>>>> hard to know what the RHS will do in that replacement. It's probably >>>>> one >>>>> of >>>>> the most complex replacement operations; far more complex than the >>>>> >>>> typical >>> >>>> one, including levels<- on factor. There's nothing wrong with that; it's >>>>> just complexity that we should be able to hide. >>>>> >>>>> Michael >>>>> >>>>> H. >>>>> >>>>>> >>>>>> >>>>>> Michael >>>>>>> >>>>>>> >>>>>>> H. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Well, I wasn't even aware of that feature, so you've made >>>>>>> >>>>>> your >>> >>>> point about >>>>>>> the documentation ;) Sounds like a good compromise to me. >>>>>>> >>>>>>> Thanks for understanding, >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> >>>>>>> Martin >>>>>>> >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, May 18, 2013 at 6:00 PM, Martin Morgan >>>>>>> <mtmor...@fhcrc.org <mailto:mtmor...@fhcrc.org> >>>>>>> <mailto:mtmor...@fhcrc.org <mailto: >>>>>>> >>>>>> mtmor...@fhcrc.org >>> >>>> >>>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>> On 05/18/2013 05:42 PM, Martin Morgan wrote: >>>>>>> >>>>>>> Some of the most common operations are >>>>>>> straight-forward, e.g., >>>>>>> >>>>>>> > gr = GRanges(paste0("chr", c(1:22, >>>>>>> "X", "Y")), IRanges(1, >>>>>>> 100)) >>>>>>> > seqlevels(gr) = sub("chr", "ch", >>>>>>> seqlevels(gr)) >>>>>>> >>>>>>> >>>>>>> To get a more comprehensive example I should >>>>>>> >>>>>> have >>> >>>> followed my own >>>>>>> advice and >>>>>>> grabbed from the help page! >>>>>>> >>>>>>> ## Rename: >>>>>>> seqlevels(txdb) <- sub("chr", "", >>>>>>> seqlevels(txdb)) >>>>>>> seqlevels(txdb) >>>>>>> >>>>>>> seqlevels(txdb) <- paste0("CH", >>>>>>> seqlevels(txdb)) >>>>>>> seqlevels(txdb) >>>>>>> >>>>>>> seqlevels(txdb)[seqlevels(__**** >>>>>>> **__txdb) >>>>>>> >>>>>> == >>> >>>> >>>>> >>>>>>> "CHM"] <- "M" >>>>>>> >>>>>>> >>>>>>> seqlevels(txdb) >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Computational Biology / Fred Hutchinson Cancer >>>>>>> Research Center >>>>>>> 1100 Fairview Ave. N. >>>>>>> PO Box 19024 Seattle, WA 98109 >>>>>>> >>>>>>> Location: Arnold Building M1 B861 >>>>>>> Phone: (206) 667-2793 >>>>>>> >>>>>> <tel:%28206%29%20667-2793> >>> >>>> <tel:%28206%29%20667-2793> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Dr. Martin Morgan, PhD >>>>>>> >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N. >>>>>>> PO Box 19024 Seattle, WA 98109 >>>>>>> >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________****___________________ >>>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.*** >>>>>>> *org< >>>>>>> >>>>>> Bioc-devel@r-project.org> >>>>> >>>>>> >>>>>>>> mailing list >>>>>>> >>>>>>> https://stat.ethz.ch/mailman/_****_listinfo/bioc-devel<https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel> >>>>>>> < >>>>>>> >>>>>> https://stat.ethz.ch/mailman/_**_listinfo/bioc-devel<https://stat.ethz.ch/mailman/__listinfo/bioc-devel> >>>>> > >>>>> >>>>>> >>>>>>> >>>>>>> <https://stat.ethz.ch/mailman/****listinfo/bioc-devel<https://stat.ethz.ch/mailman/**listinfo/bioc-devel> >>>>>>> < >>>>>>> >>>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>>> > >>>>> >>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Hervé Pagès >>>>>>> >>>>>>> Program in Computational Biology >>>>>>> Division of Public Health Sciences >>>>>>> >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>> P.O. Box 19024 >>>>>>> Seattle, WA 98109-1024 >>>>>>> >>>>>>> E-mail: hpa...@fhcrc.org <mailto:hpa...@fhcrc.org> >>>>>>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >>>>>>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>> Hervé Pagès >>>>>> >>>>>> Program in Computational Biology >>>>>> Division of Public Health Sciences >>>>>> Fred Hutchinson Cancer Research Center >>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>> P.O. Box 19024 >>>>>> Seattle, WA 98109-1024 >>>>>> >>>>>> E-mail: hpa...@fhcrc.org >>>>>> Phone: (206) 667-5791 >>>>>> Fax: (206) 667-1319 >>>>>> >>>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> >>>>> ______________________________**_________________ >>>>> Bioc-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>>>> >>>>> >>>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> >>> ______________________________**_________________ >>> Bioc-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>> >>> >>> >> [[alternative HTML version deleted]] >> >> >> >> ______________________________**_________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel