Yes, thanks for those details, Herve. I changed rtracklayer to take the first word as the seqlevels.
Michael On Sun, Jan 10, 2016 at 9:50 AM, Rainer Johannes <johannes.rai...@eurac.edu> wrote: > That would be great! I don’t think we would loose any information with that > behaviour. All fasta files from Ensembl have that format (id, whitespace and > description). Implementing that would render the fasta files from Ensembl > provided as TwoBit files via AnnotationHub usable without having to tweak the > objects afterwards. I would highly appreciate that (and people working with > other species than mouse/human/rat too)! > > jo > >> On 09 Jan 2016, at 21:19, Hervé Pagès <hpa...@fredhutch.org> wrote: >> >> On 01/09/2016 08:42 AM, Michael Lawrence wrote: >>> I can understand the desire to avoid defining and enforcing our own >>> standards on third-party data: it's error-prone, potentially >>> confusing, etc. But the same is even more true of expecting the user >>> to perform the mapping via some adhoc approach. >>> >>> It's unfortunate that Ensembl does not follow the convention of naming >>> their FASTA sequences by their seqlevels, but I'm not sure how >>> wide-spread that convention is in the first place. >> >> Actually, and according to https://en.wikipedia.org/wiki/FASTA_format, >> it seems that: >> >> The word following the ">" symbol is the identifier of the sequence, >> and the rest of the line is the description (both are optional). >> >> so Rsamtools::indexFa is doing the right thing by trimming the >> description line. Maybe that's what seqinfo,TwoBitFile should do >> too? >> >> H. >> >>> >>> Why does Bioconductor distribute genomes in two different ways: >>> BSgenome and via AnnotationHub? Couldn't those two distribution >>> mechanisms be unified? That might mitigate some of the maintenance >>> cost and better encapsulate the added complexity. >>> >>> Michael >>> >>> On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin >>> <martin.mor...@roswellpark.org> wrote: >>>> We switched to TwoBitFile with a recent ensembl release, thinking that it >>>> had better performance and other characteristics compared to the previous >>>> FaFile. >>>> >>>> The 'recipe' used to create the FaFiles did not explicitly trim the label; >>>> that appears to be something done by Rsamtools::indexFa and hence (a now >>>> quite dated) version of samtools. >>>> >>>> I'm not precisely sure where we stand on correcting this. The original >>>> approach just takes what we're given and makes a 2bit file. At least >>>> provisionally we had decided (after Thurs / Fri exchanges) to make the >>>> seqlevels sensible on the way in to annotation hub; this is against Sean's >>>> advice and I'm not really a big fan of this. >>>> >>>> I like the idea of being able to dynamically remap the seqlevels when the >>>> 2bit file is loaded by AnnotationHub, which would require Herve's >>>> suggestion of settable seqlevels on TwoBitFile. >>>> >>>> Martin >>>> ________________________________________ >>>> From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer >>>> Johannes [johannes.rai...@eurac.edu] >>>> Sent: Saturday, January 09, 2016 11:01 AM >>>> To: Hervé Pagès >>>> Cc: Michael Lawrence; Martin Morgan >>>> Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from >>>> AnnotationHub >>>> >>>> Yes, using BSGenome would help in this case. >>>> In the long run I think it might be important to have this fixed, not >>>> necessarily for human, but for other species/genome builds for which there >>>> might not be an BSGenome package available; through AnnotationHub all GTF >>>> files and fasta files would be available. Note also that the FaFiles from >>>> Ensembl do have the “correct” chromosome names although I assume they were >>>> built from the same Ensembl fasta files than the TwoBitFiles. >>>> >>>> jo >>>> >>>>> On 08 Jan 2016, at 22:49, Hervé Pagès <hpa...@fredhutch.org> wrote: >>>>> >>>>> On 01/08/2016 01:09 PM, Michael Lawrence wrote: >>>>>> That is one solution. But everyone using that genome would need to >>>>>> reset the seqlevels to the "standard" ones. In this specific case, is >>>>>> there any reason not to just use the BSgenome for GRCh38? >>>>> >>>>> I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular >>>>> use case. Just wanted to mention that the ability to rename the >>>>> sequences in a TwoBitFile, FastaFile, or other file-based object that >>>>> supports seqinfo() would be useful in general. >>>>> >>>>> H. >>>>> >>>>>> >>>>>> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès <hpa...@fredhutch.org> >>>>>> wrote: >>>>>>> Hi Jo, Michael, >>>>>>> >>>>>>> What about implementing a seqlevels() setter for TwoBitFile objects? All >>>>>>> you need for this is an extra slot for storing the user-supplied >>>>>>> seqlevels. Note that in general the seqlevels() setter allows more than >>>>>>> renaming the seqlevels. It also allows dropping, adding, and shuffling >>>>>>> them. But you don't need to support all that. Supporting renaming would >>>>>>> already go a long way. See selectMethod("seqlevels<-", "TxDb") in >>>>>>> GenomicFeatures for an example of a restricted "seqlevels<-" method. >>>>>>> >>>>>>> H. >>>>>>> >>>>>>> >>>>>>> On 01/08/2016 09:50 AM, Rainer Johannes wrote: >>>>>>>> >>>>>>>> I agree, I would not modify the file content. At present it is however >>>>>>>> not >>>>>>>> possible to use e.g. getSeq on these TwoBitFiles, since the chromosome >>>>>>>> names >>>>>>>> in the submitted GRanges (e.g. 1) do not match the seqnames/seqinfo of >>>>>>>> the >>>>>>>> TwoBitFile. I don’t know if a seqnames or seqinfo method stripping of >>>>>>>> all >>>>>>>> but the first name-part would help here... >>>>>>>> >>>>>>>> jo >>>>>>>> >>>>>>>>> On 08 Jan 2016, at 15:18, Sean Davis <seand...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> I will make the small editorial comment to guard against modifying >>>>>>>>> file >>>>>>>>> content on transit into the hub object. On the client side (after >>>>>>>>> getting >>>>>>>>> such an object) I think a “fix” would be to have a quick seqnames >>>>>>>>> method to >>>>>>>>> strip off all but the first whitespace delimited piece. >>>>>>>>> >>>>>>>>> Sean >>>>>>>>> >>>>>>>>>> On Jan 8, 2016, at 8:40 AM, Michael Lawrence >>>>>>>>>> <lawrence.mich...@gene.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> This is perhaps something that could be handled when population the >>>>>>>>>> hub, but I'm not sure how rtracklayer could automatically derive the >>>>>>>>>> chromosome names. >>>>>>>>>> >>>>>>>>>> On Fri, Jan 8, 2016 at 2:37 AM, Rainer Johannes >>>>>>>>>> <johannes.rai...@eurac.edu> wrote: >>>>>>>>>>> >>>>>>>>>>> dear all, >>>>>>>>>>> >>>>>>>>>>> I just run into a problem with a TwoBitFile I fetched from >>>>>>>>>>> AnnotationHub. I was fetching a TwoBitFile with the genomic DNA >>>>>>>>>>> sequence, as >>>>>>>>>>> provided by Ensembl: >>>>>>>>>>> >>>>>>>>>>>> library(AnnotationHub) >>>>>>>>>>>> ah <- AnnotationHub() >>>>>>>>>>>> tbf <- ah[["AH50068”]] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> head(seqnames(seqinfo(tbf))) >>>>>>>>>>> >>>>>>>>>>> [1] "1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF" >>>>>>>>>>> [2] "10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF" >>>>>>>>>>> [3] "11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF" >>>>>>>>>>> [4] "12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF" >>>>>>>>>>> [5] "13 dna:chromosome chromosome:GRCh38:13:1:114364328:1 REF" >>>>>>>>>>> [6] "14 dna:chromosome chromosome:GRCh38:14:1:107043718:1 REF" >>>>>>>>>>> >>>>>>>>>>> Would be nice, if the seqnames would be really just the chromsome >>>>>>>>>>> names >>>>>>>>>>> and not the whole string from the FA file header. Is there a way I >>>>>>>>>>> could fix >>>>>>>>>>> the file myself or is this something that should be fixed in the >>>>>>>>>>> rtracklayer >>>>>>>>>>> or AnnotationHub package when the TwoBitFile is created? >>>>>>>>>>> >>>>>>>>>>> thanks, jo >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioc-devel@r-project.org mailing list >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Hervé Pagès >>>>>>> >>>>>>> Program in Computational Biology >>>>>>> Division of Public Health Sciences >>>>>>> Fred Hutchinson Cancer Research Center >>>>>>> 1100 Fairview Ave. N, M1-B514 >>>>>>> P.O. Box 19024 >>>>>>> Seattle, WA 98109-1024 >>>>>>> >>>>>>> E-mail: hpa...@fredhutch.org >>>>>>> Phone: (206) 667-5791 >>>>>>> Fax: (206) 667-1319 >>>>> >>>>> -- >>>>> Hervé Pagès >>>>> >>>>> Program in Computational Biology >>>>> Division of Public Health Sciences >>>>> Fred Hutchinson Cancer Research Center >>>>> 1100 Fairview Ave. N, M1-B514 >>>>> P.O. Box 19024 >>>>> Seattle, WA 98109-1024 >>>>> >>>>> E-mail: hpa...@fredhutch.org >>>>> Phone: (206) 667-5791 >>>>> Fax: (206) 667-1319 >>>> >>>> _______________________________________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel >>>> >>>> >>>> This email message may contain legally privileged and/or confidential >>>> information. If you are not the intended recipient(s), or the employee or >>>> agent responsible for the delivery of this message to the intended >>>> recipient(s), you are hereby notified that any disclosure, copying, >>>> distribution, or use of this email message is prohibited. If you have >>>> received this message in error, please notify the sender immediately by >>>> e-mail and delete this email message from your computer. Thank you. >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpa...@fredhutch.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 > _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel