On Thu, Jun 4, 2015 at 11:48 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > I also think that we're heading towards something like that. > > So genome(gr) <- "hg19" would: > > (a) Add any missing information to the seqinfo. > (b) Sort the seqlevels in "canonical" order. > (c) Change the seqlevels style to UCSC style if they are not. > > The 3 tasks are orthogonal. I guess most of the times people want > an easy way to perform them all at once. > > We could easily support (a) and (b). This assumes that the current > seqlevels are already valid hg19 seqlevels: > > si1 <- Seqinfo(c("chrX", "chrUn_gl000249", "chr2", "chr6_cox_hap2")) > gr1 <- GRanges(seqinfo=si1) > hg19_si <- Seqinfo(genome="hg19") > > ## (a): > seqinfo(gr1) <- merge(seqinfo(gr1), hg19_si)[seqlevels(gr1)] > seqinfo(gr1) > # Seqinfo object with 4 sequences (1 circular) from hg19 genome: > # seqnames seqlengths isCircular genome > # chrX 155270560 FALSE hg19 > # chrUn_gl000249 38502 FALSE hg19 > # chr2 243199373 FALSE hg19 > # chr6_cox_hap2 4795371 FALSE hg19 > > ## (b): > seqlevels(gr1) <- intersect(seqlevels(hg19_si), seqlevels(gr1)) > seqinfo(gr1) > # Seqinfo object with 4 sequences (1 circular) from hg19 genome: > # seqnames seqlengths isCircular genome > # chr2 243199373 FALSE hg19 > # chrX 155270560 FALSE hg19 > # chr6_cox_hap2 4795371 FALSE hg19 > # chrUn_gl000249 38502 FALSE hg19 > > (c) is harder because seqlevelsStyle() doesn't know how to rename > scaffolds yet: > > si2 <- Seqinfo(c("X", "HSCHRUN_RANDOM_CTG42", "2", "HSCHR6_MHC_COX_CTG1")) > gr2 <- GRanges(seqinfo=si2) > > seqlevelsStyle(gr2) > # [1] "NCBI" > > seqlevelsStyle(gr2) <- "UCSC" > seqlevels(gr2) > # [1] "chrX" "HSCHRUN_RANDOM_CTG42" "chr2" > # [4] "HSCHR6_MHC_COX_CTG1" > > So we need to work on this. > > I'm not sure about using genome(gr) <- "hg19" for this. Right now > it sets the genome column of the seqinfo with the supplied string > and nothing else. Aren't there valid use cases for this?
Not sure. People would almost always want the seqname style and order to be consistent with the given genome. > What about > using seqinfo(gr) <- "hg19" instead? It kind of suggests that the > whole seqinfo component actually gets filled. > Yea, but "genome" is so intuitive compared to "seqinfo". > H. > > On 06/04/2015 06:30 PM, Tim Triche, Jr. wrote: >> >> that's kind of always been my goal... >> >> >> Statistics is the grammar of science. >> Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> >> >> On Thu, Jun 4, 2015 at 6:29 PM, Michael Lawrence >> <lawrence.mich...@gene.com <mailto:lawrence.mich...@gene.com>> wrote: >> >> Maybe this could eventually support setting the seqinfo with: >> >> genome(gr) <- "hg19" >> >> Or is that being too clever? >> >> On Thu, Jun 4, 2015 at 4:28 PM, Hervé Pagès <hpa...@fredhutch.org >> <mailto:hpa...@fredhutch.org>> wrote: >> > Hi, >> > >> > FWIW I started to work on supporting quick generation of a >> standalone >> > Seqinfo object via Seqinfo(genome="hg38") in GenomeInfoDb. >> > >> > It already supports hg38, hg19, hg18, panTro4, panTro3, panTro2, >> > bosTau8, bosTau7, bosTau6, canFam3, canFam2, canFam1, musFur1, >> mm10, >> > mm9, mm8, susScr3, susScr2, rn6, rheMac3, rheMac2, galGal4, >> galGal3, >> > gasAcu1, danRer7, apiMel2, dm6, dm3, ce10, ce6, ce4, ce2, sacCer3, >> > and sacCer2. I'll add more. >> > >> > See ?Seqinfo for some examples. >> > >> > Right now it fetches the information from internet every time you >> > call it but maybe we should just store that information in the >> > GenomeInfoDb package as Tim suggested? >> > >> > H. >> > >> > >> > On 06/03/2015 12:54 PM, Tim Triche, Jr. wrote: >> >> >> >> That would be perfect actually. And it would radically reduce & >> >> modularize maintenance. Maybe that's the best way to go after >> all. Quite >> >> sensible. >> >> >> >> --t >> >> >> >>> On Jun 3, 2015, at 12:46 PM, Vincent Carey >> <st...@channing.harvard.edu <mailto:st...@channing.harvard.edu>> >> >>> wrote: >> >>> >> >>> It really isn't hard to have multiple OrganismDb packages in >> place -- the >> >>> process of making new ones is documented and was given as an >> exercise in >> >>> the EdX course. I don't know if we want to institutionalize it >> and >> >>> distribute such -- I think we might, so that there would be >> Hs19, Hs38, >> >>> mm9, etc. packages. They have very little content, they just >> coordinate >> >>> interactions with packages that you'll already have. >> >>> >> >>> On Wed, Jun 3, 2015 at 3:26 PM, Tim Triche, Jr. >> <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>> >> >> >>> wrote: >> >>> >> >>>> Right, I typically do that too, and if you're working on human >> data it >> >>>> isn't a big deal. What makes things a lot more of a drag is >> when you >> >>>> work >> >>>> on e.g. mouse data (mm9 vs mm10, aka GRCm37 vs GRCm38) where >> >>>> Mus.musculus >> >>>> is essentially a "build ahead" of Homo.sapiens. >> >>>> >> >>>> R> seqinfo(Homo.sapiens) >> >>>> Seqinfo object with 93 sequences (1 circular) from hg19 genome >> >>>> >> >>>> R> seqinfo(Mus.musculus) >> >>>> Seqinfo object with 66 sequences (1 circular) from mm10 genome: >> >>>> >> >>>> It's not as explicit as directly assigning the seqinfo from a >> genome >> >>>> that >> >>>> corresponds to that of your annotations/results/whatever. I >> know we >> >>>> could >> >>>> all use crossmap or liftOver or whatever, but that's not >> really the >> >>>> same, >> >>>> and it takes time, whereas assigning the proper seqinfo for >> >>>> relationships >> >>>> is very fast. >> >>>> >> >>>> That's all I was getting at... >> >>>> >> >>>> >> >>>> Statistics is the grammar of science. >> >>>> Karl Pearson >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science> >> >>>> >> >>>> On Wed, Jun 3, 2015 at 12:17 PM, Vincent Carey >> >>>> <st...@channing.harvard.edu <mailto:st...@channing.harvard.edu> >> >>>>> >> >>>>> wrote: >> >>>> >> >>>> >> >>>>> I typically get this info from Homo.sapiens. The result is >> parasitic >> >>>>> on >> >>>>> the TxDb that is in there. I don't know how easy it is to swap >> >>>>> alternate >> >>>>> TxDb in to get a different build. I think it would make sense >> to >> >>>>> regard >> >>>>> the OrganismDb instances as foundational for this sort of >> structural >> >>>>> data. >> >>>>> >> >>>>> On Wed, Jun 3, 2015 at 3:12 PM, Kasper Daniel Hansen < >> >>>>> kasperdanielhan...@gmail.com >> <mailto:kasperdanielhan...@gmail.com>> wrote: >> >>>>> >> >>>>>> Let me rephrase this slightly. From one POV the purpose of >> >>>>>> GenomeInfoDb >> >>>>>> is >> >>>>>> clean up the seqinfo slot. Currently it does most of the >> cleaning, >> >>>>>> but >> >>>>>> it >> >>>>>> does not add seqlengths. >> >>>>>> >> >>>>>> It is clear that seqlengths depends on the version of the >> genome, but >> >>>>>> I >> >>>>>> will argue so does the seqnames. Of course, for human, >> chr22 will not >> >>>>>> change. But what about the names of all the random >> contigs? Or for >> >>>>>> other >> >>>>>> organisms, what about going from a draft genome with 10k >> contigs to a >> >>>>>> more >> >>>>>> completely genome assembled into fewer, larger chromosomes. >> >>>>>> >> >>>>>> I acknowledge that this information is present in the BSgenome >> >>>>>> packages, >> >>>>>> but it seems (to me) to be very appropriate to have them >> around for >> >>>>>> cleaning up the seqinfo slot. For some situations it is not >> great to >> >>>>>> depend on 1 GB> download for something that is a few bytes. >> >>>>>> >> >>>>>> Best, >> >>>>>> Kasper >> >>>>>> >> >>>>>> On Wed, Jun 3, 2015 at 3:00 PM, Tim Triche, Jr. >> <tim.tri...@gmail.com <mailto:tim.tri...@gmail.com>> >> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> It would be nice (for a number of reasons) to have >> chromosome lengths >> >>>>>>> readily available in a foundational package like >> GenomeInfoDb, so >> >>>>>>> that, >> >>>>>>> say, >> >>>>>>> >> >>>>>>> data(seqinfo.hg19) >> >>>>>>> seqinfo(myResults) <- seqinfo.hg19[ seqlevels(myResults) ] >> >>>>>>> >> >>>>>>> would work without issues. Is there any particular reason >> this >> >>>>>> >> >>>>>> couldn't >> >>>>>>> >> >>>>>>> happen for the supported/available BSgenomes? It would >> seem like a >> >>>>>> >> >>>>>> simple >> >>>>>>> >> >>>>>>> matter to do >> >>>>>>> >> >>>>>>> R> library(BSgenome.Hsapiens.UCSC.hg19) >> >>>>>>> R> seqinfo.hg19 <- seqinfo(Hsapiens) >> >>>>>>> R> save(seqinfo.hg19, >> >>>>>>> file="~/bioc-devel/GenomeInfoDb/data/seqinfo.hg19.rda") >> >>>>>>> >> >>>>>>> and be done with it until (say) the next release or next >> released >> >>>>>>> BSgenome. I considered looping through the following >> BSgenomes >> >>>>>> >> >>>>>> myself... >> >>>>>>> >> >>>>>>> and if it isn't strongly opposed by (everyone) I may still >> do exactly >> >>>>>>> that. Seems useful, no? >> >>>>>>> >> >>>>>>> e.g. for the following 42 builds, >> >>>>>>> >> >>>>>>> grep("(UCSC|NCBI)", unique(gsub(".masked", "", >> available.genomes())), >> >>>>>>> value=TRUE) >> >>>>>>> [1] "BSgenome.Amellifera.UCSC.apiMel2" >> >>>>>> >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau3" >> >>>>>>> >> >>>>>>> >> >>>>>>> [3] "BSgenome.Btaurus.UCSC.bosTau4" >> >>>>>> >> >>>>>> "BSgenome.Btaurus.UCSC.bosTau6" >> >>>>>>> >> >>>>>>> >> >>>>>>> [5] "BSgenome.Btaurus.UCSC.bosTau8" >> >>>>>>> "BSgenome.Celegans.UCSC.ce10" >> >>>>>>> >> >>>>>>> [7] "BSgenome.Celegans.UCSC.ce2" >> "BSgenome.Celegans.UCSC.ce6" >> >>>>>>> >> >>>>>>> [9] "BSgenome.Cfamiliaris.UCSC.canFam2" >> >>>>>>> "BSgenome.Cfamiliaris.UCSC.canFam3" >> >>>>>>> [11] "BSgenome.Dmelanogaster.UCSC.dm2" >> >>>>>>> "BSgenome.Dmelanogaster.UCSC.dm3" >> >>>>>>> [13] "BSgenome.Dmelanogaster.UCSC.dm6" >> >>>>>> >> >>>>>> "BSgenome.Drerio.UCSC.danRer5" >> >>>>>>> >> >>>>>>> >> >>>>>>> [15] "BSgenome.Drerio.UCSC.danRer6" >> >>>>>> >> >>>>>> "BSgenome.Drerio.UCSC.danRer7" >> >>>>>>> >> >>>>>>> >> >>>>>>> [17] "BSgenome.Ecoli.NCBI.20080805" >> >>>>>>> "BSgenome.Gaculeatus.UCSC.gasAcu1" >> >>>>>>> [19] "BSgenome.Ggallus.UCSC.galGal3" >> >>>>>> >> >>>>>> "BSgenome.Ggallus.UCSC.galGal4" >> >>>>>>> >> >>>>>>> >> >>>>>>> [21] "BSgenome.Hsapiens.NCBI.GRCh38" >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg17" >> >>>>>>> >> >>>>>>> [23] "BSgenome.Hsapiens.UCSC.hg18" >> >>>>>>> "BSgenome.Hsapiens.UCSC.hg19" >> >>>>>>> >> >>>>>>> [25] "BSgenome.Hsapiens.UCSC.hg38" >> >>>>>>> "BSgenome.Mfascicularis.NCBI.5.0" >> >>>>>>> [27] "BSgenome.Mfuro.UCSC.musFur1" >> >>>>>> >> >>>>>> "BSgenome.Mmulatta.UCSC.rheMac2" >> >>>>>>> >> >>>>>>> >> >>>>>>> [29] "BSgenome.Mmulatta.UCSC.rheMac3" >> >>>>>> >> >>>>>> "BSgenome.Mmusculus.UCSC.mm10" >> >>>>>>> >> >>>>>>> >> >>>>>>> [31] "BSgenome.Mmusculus.UCSC.mm8" >> >>>>>>> "BSgenome.Mmusculus.UCSC.mm9" >> >>>>>>> >> >>>>>>> [33] "BSgenome.Ptroglodytes.UCSC.panTro2" >> >>>>>>> "BSgenome.Ptroglodytes.UCSC.panTro3" >> >>>>>>> [35] "BSgenome.Rnorvegicus.UCSC.rn4" >> >>>>>> >> >>>>>> "BSgenome.Rnorvegicus.UCSC.rn5" >> >>>>>>> >> >>>>>>> >> >>>>>>> [37] "BSgenome.Rnorvegicus.UCSC.rn6" >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer1" >> >>>>>>> [39] "BSgenome.Scerevisiae.UCSC.sacCer2" >> >>>>>>> "BSgenome.Scerevisiae.UCSC.sacCer3" >> >>>>>>> [41] "BSgenome.Sscrofa.UCSC.susScr3" >> >>>>>> >> >>>>>> "BSgenome.Tguttata.UCSC.taeGut1" >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Am I insane for suggesting this? It would make things a >> little >> >>>>>>> easier >> >>>>>> >> >>>>>> for >> >>>>>>> >> >>>>>>> rtracklayer, most SummarizedExperiment and SE-derived >> objects, blah, >> >>>>>> >> >>>>>> blah, >> >>>>>>> >> >>>>>>> blah... >> >>>>>>> >> >>>>>>> >> >>>>>>> Best, >> >>>>>>> >> >>>>>>> --t >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Statistics is the grammar of science. >> >>>>>>> Karl Pearson >> <http://en.wikipedia.org/wiki/The_Grammar_of_Science> >> >>>>>> >> >>>>>> >> >>>>>> [[alternative HTML version deleted]] >> >>>>>> >> >>>>>> _______________________________________________ >> >>>>>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> mailing list >> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >>> >> >>> >> >>> [[alternative HTML version deleted]] >> >>> >> >>> _______________________________________________ >> >>> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> mailing list >> >>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> >> >> >> >> _______________________________________________ >> >> Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> mailing list >> >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> >> > >> > -- >> > Hervé Pagès >> > >> > Program in Computational Biology >> > Division of Public Health Sciences >> > Fred Hutchinson Cancer Research Center >> > 1100 Fairview Ave. N, M1-B514 >> > P.O. Box 19024 >> > Seattle, WA 98109-1024 >> > >> > E-mail: hpa...@fredhutch.org <mailto:hpa...@fredhutch.org> >> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> > >> > >> > _______________________________________________ >> > Bioc-devel@r-project.org <mailto:Bioc-devel@r-project.org> >> mailing list >> > https://stat.ethz.ch/mailman/listinfo/bioc-devel >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel