Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub
Yes, thanks for those details, Herve. I changed rtracklayer to take the first word as the seqlevels. Michael On Sun, Jan 10, 2016 at 9:50 AM, Rainer Johannes wrote: > That would be great! I don’t think we would loose any information with that > behaviour. All fasta files from Ensembl have that format (id, whitespace and > description). Implementing that would render the fasta files from Ensembl > provided as TwoBit files via AnnotationHub usable without having to tweak the > objects afterwards. I would highly appreciate that (and people working with > other species than mouse/human/rat too)! > > jo > >> On 09 Jan 2016, at 21:19, Hervé Pagès wrote: >> >> On 01/09/2016 08:42 AM, Michael Lawrence wrote: >>> I can understand the desire to avoid defining and enforcing our own >>> standards on third-party data: it's error-prone, potentially >>> confusing, etc. But the same is even more true of expecting the user >>> to perform the mapping via some adhoc approach. >>> >>> It's unfortunate that Ensembl does not follow the convention of naming >>> their FASTA sequences by their seqlevels, but I'm not sure how >>> wide-spread that convention is in the first place. >> >> Actually, and according to https://en.wikipedia.org/wiki/FASTA_format, >> it seems that: >> >> The word following the ">" symbol is the identifier of the sequence, >> and the rest of the line is the description (both are optional). >> >> so Rsamtools::indexFa is doing the right thing by trimming the >> description line. Maybe that's what seqinfo,TwoBitFile should do >> too? >> >> H. >> >>> >>> Why does Bioconductor distribute genomes in two different ways: >>> BSgenome and via AnnotationHub? Couldn't those two distribution >>> mechanisms be unified? That might mitigate some of the maintenance >>> cost and better encapsulate the added complexity. >>> >>> Michael >>> >>> On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin >>> wrote: We switched to TwoBitFile with a recent ensembl release, thinking that it had better performance and other characteristics compared to the previous FaFile. The 'recipe' used to create the FaFiles did not explicitly trim the label; that appears to be something done by Rsamtools::indexFa and hence (a now quite dated) version of samtools. I'm not precisely sure where we stand on correcting this. The original approach just takes what we're given and makes a 2bit file. At least provisionally we had decided (after Thurs / Fri exchanges) to make the seqlevels sensible on the way in to annotation hub; this is against Sean's advice and I'm not really a big fan of this. I like the idea of being able to dynamically remap the seqlevels when the 2bit file is loaded by AnnotationHub, which would require Herve's suggestion of settable seqlevels on TwoBitFile. Martin From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer Johannes [johannes.rai...@eurac.edu] Sent: Saturday, January 09, 2016 11:01 AM To: Hervé Pagès Cc: Michael Lawrence; Martin Morgan Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub Yes, using BSGenome would help in this case. In the long run I think it might be important to have this fixed, not necessarily for human, but for other species/genome builds for which there might not be an BSGenome package available; through AnnotationHub all GTF files and fasta files would be available. Note also that the FaFiles from Ensembl do have the “correct” chromosome names although I assume they were built from the same Ensembl fasta files than the TwoBitFiles. jo > On 08 Jan 2016, at 22:49, Hervé Pagès wrote: > > On 01/08/2016 01:09 PM, Michael Lawrence wrote: >> That is one solution. But everyone using that genome would need to >> reset the seqlevels to the "standard" ones. In this specific case, is >> there any reason not to just use the BSgenome for GRCh38? > > I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular > use case. Just wanted to mention that the ability to rename the > sequences in a TwoBitFile, FastaFile, or other file-based object that > supports seqinfo() would be useful in general. > > H. > >> >> On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès >> wrote: >>> Hi Jo, Michael, >>> >>> What about implementing a seqlevels() setter for TwoBitFile objects? All >>> you need for this is an extra slot for storing the user-supplied >>> seqlevels. Note that in general the seqlevels() setter allows more than >>> renaming the seqlevels. It also allows dropping, adding, and shuffling >>> them. But you don't need to support all that. Supporting renaming would >>> already go a long way. See selectMethod("seqlevels<-", "Tx
Re: [Bioc-devel] Problem with seqnames of TwoBitFile from AnnotationHub
That would be great! I don’t think we would loose any information with that behaviour. All fasta files from Ensembl have that format (id, whitespace and description). Implementing that would render the fasta files from Ensembl provided as TwoBit files via AnnotationHub usable without having to tweak the objects afterwards. I would highly appreciate that (and people working with other species than mouse/human/rat too)! jo > On 09 Jan 2016, at 21:19, Hervé Pagès wrote: > > On 01/09/2016 08:42 AM, Michael Lawrence wrote: >> I can understand the desire to avoid defining and enforcing our own >> standards on third-party data: it's error-prone, potentially >> confusing, etc. But the same is even more true of expecting the user >> to perform the mapping via some adhoc approach. >> >> It's unfortunate that Ensembl does not follow the convention of naming >> their FASTA sequences by their seqlevels, but I'm not sure how >> wide-spread that convention is in the first place. > > Actually, and according to https://en.wikipedia.org/wiki/FASTA_format, > it seems that: > > The word following the ">" symbol is the identifier of the sequence, > and the rest of the line is the description (both are optional). > > so Rsamtools::indexFa is doing the right thing by trimming the > description line. Maybe that's what seqinfo,TwoBitFile should do > too? > > H. > >> >> Why does Bioconductor distribute genomes in two different ways: >> BSgenome and via AnnotationHub? Couldn't those two distribution >> mechanisms be unified? That might mitigate some of the maintenance >> cost and better encapsulate the added complexity. >> >> Michael >> >> On Sat, Jan 9, 2016 at 8:12 AM, Morgan, Martin >> wrote: >>> We switched to TwoBitFile with a recent ensembl release, thinking that it >>> had better performance and other characteristics compared to the previous >>> FaFile. >>> >>> The 'recipe' used to create the FaFiles did not explicitly trim the label; >>> that appears to be something done by Rsamtools::indexFa and hence (a now >>> quite dated) version of samtools. >>> >>> I'm not precisely sure where we stand on correcting this. The original >>> approach just takes what we're given and makes a 2bit file. At least >>> provisionally we had decided (after Thurs / Fri exchanges) to make the >>> seqlevels sensible on the way in to annotation hub; this is against Sean's >>> advice and I'm not really a big fan of this. >>> >>> I like the idea of being able to dynamically remap the seqlevels when the >>> 2bit file is loaded by AnnotationHub, which would require Herve's >>> suggestion of settable seqlevels on TwoBitFile. >>> >>> Martin >>> >>> From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer >>> Johannes [johannes.rai...@eurac.edu] >>> Sent: Saturday, January 09, 2016 11:01 AM >>> To: Hervé Pagès >>> Cc: Michael Lawrence; Martin Morgan >>> Subject: Re: [Bioc-devel] Problem with seqnames of TwoBitFile from >>> AnnotationHub >>> >>> Yes, using BSGenome would help in this case. >>> In the long run I think it might be important to have this fixed, not >>> necessarily for human, but for other species/genome builds for which there >>> might not be an BSGenome package available; through AnnotationHub all GTF >>> files and fasta files would be available. Note also that the FaFiles from >>> Ensembl do have the “correct” chromosome names although I assume they were >>> built from the same Ensembl fasta files than the TwoBitFiles. >>> >>> jo >>> On 08 Jan 2016, at 22:49, Hervé Pagès wrote: On 01/08/2016 01:09 PM, Michael Lawrence wrote: > That is one solution. But everyone using that genome would need to > reset the seqlevels to the "standard" ones. In this specific case, is > there any reason not to just use the BSgenome for GRCh38? I agree. Maybe we don't need seqlevels<-,TwoBitFile for that particular use case. Just wanted to mention that the ability to rename the sequences in a TwoBitFile, FastaFile, or other file-based object that supports seqinfo() would be useful in general. H. > > On Fri, Jan 8, 2016 at 11:04 AM, Hervé Pagès wrote: >> Hi Jo, Michael, >> >> What about implementing a seqlevels() setter for TwoBitFile objects? All >> you need for this is an extra slot for storing the user-supplied >> seqlevels. Note that in general the seqlevels() setter allows more than >> renaming the seqlevels. It also allows dropping, adding, and shuffling >> them. But you don't need to support all that. Supporting renaming would >> already go a long way. See selectMethod("seqlevels<-", "TxDb") in >> GenomicFeatures for an example of a restricted "seqlevels<-" method. >> >> H. >> >> >> On 01/08/2016 09:50 AM, Rainer Johannes wrote: >>> >>> I agree, I would not modify the file content. At present it is however
Re: [Bioc-devel] Use of EnsDb in the AnnotationDbi framework
Thanks Johannes, this is a valuable contribution and I've added it to the 'official' vignette. Martin From: Bioc-devel [bioc-devel-boun...@r-project.org] on behalf of Rainer Johannes [johannes.rai...@eurac.edu] Sent: Thursday, January 07, 2016 9:10 AM To: Bioc-devel Subject: [Bioc-devel] Use of EnsDb in the AnnotationDbi framework Dear all, thanks to Vince’s suggestion I have now implemented the central AnnotationDbi methods “columns”, “keys”, “keytypes” and “select” for EnsDb objects in my ensembldb package version 1.3.11 (EnsDb are TxDb like annotation packages/objects tailored to Ensembl based annotations). Thus, EnsDb based annotation packages can now be used in the AnnotationDbi framework. The methods support in addition also the filter framework of the ensembldb package to provide some more fine grained querying and data retrieval. I have forked the AnnotationDbi package and added a section to its “IntroToAnnotationPackages.Rnw” vignette describing how to use EnsDb objects along with the above mentioned methods (https://github.com/jotsetung/AnnotationDbi). Eventually this could also be inserted to the “official” AnnotationDbi package. best, jo ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you. ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel