Hej Martin! That is indeed extremely useful. I’ll add that to my vignette as a source for annotation. One thing that comes immediately to mind is to have AnnotationHub and GenomicFeatures interact. I could imagine it would be useful to have e.g. a makeTranscriptDbFromAnnotationHub function. I admit I haven’t though about this in detail, but I already can think of use cases where such a function would come in handy.
Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delho...@embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 15 Nov 2013, at 20:13, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 11/15/2013 10:22 AM, Michael Lawrence wrote: >> Doesn't look like genomeIntervals has any C code (?), so a performance >> comparison would be interesting. rtracklayer jumps through all sorts of >> hoops to handle obscure things like URL encoding in GFF3. The code in >> genomeIntervals seems more streamlined. > > Wanted to mention, and it would be good to know if this was not helpful at > all, that the Ensembl gtf files are available through AnnotationHub as > GRanges objects > > > library(AnnotationHub) > > hub = AnnotationHub() > > hub$ensembl.release.73.<tab> > hub$ensembl.release.73.fasta. ... [378] > hub$ensembl.release.73.gtf. ... [63] > > xx = > > hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4.73.gtf_0.0.1.RData > > xx > GRanges with 381368 ranges and 12 metadata columns: > seqnames ranges strand | source type > <Rle> <IRanges> <Rle> | <factor> <factor> > [1] 1 [1735, 2449] + | protein_coding exon > [2] 1 [2379, 2449] + | protein_coding CDS > score phase gene_id transcript_id > <numeric> <integer> <character> <character> > [1] <NA> <NA> ENSGALG00000009771 ENSGALT00000015891 > [2] <NA> 0 ENSGALG00000009771 ENSGALT00000015891 > exon_number gene_biotype exon_id protein_id > <numeric> <character> <character> <character> > [1] 1 protein_coding ENSGALE00000301221 <NA> > [2] 1 protein_coding <NA> ENSGALP00000015874 > gene_name transcript_name > <character> <character> > [1] <NA> <NA> > [2] <NA> <NA> > [ reached getOption("max.print") -- omitted 9 rows ] > --- > seqlengths: > 1 2 ... AADN03010940.1 > NA NA ... NA > > Martin > >> >> >> >> >> >> >> On Fri, Nov 15, 2013 at 10:14 AM, Nicolas Delhomme <delho...@embl.de> wrote: >> >>> Took that thread to the devel list, just feels more appropriate with >>> regards to the content. >>> >>> I already have that on my TODO list :-). This is not up-to-date, i.e. I >>> haven’t done the comparison in ~2 years, but last time I did, >>> genomeIntervals attribute parsing was faster than rtracklayer equivalent. I >>> suppose that’s because it is already implemented in C in genomeIntervals. >>> As said I don’t have any actual comparative numbers, still you might want >>> to have a look at the genomeIntervals code. As I don’t think that >>> genomeIntervals get as much exposition as rtracklayer does, many more >>> people would benefit from an equivalent rtracklayer implementation. If >>> you’re interested, I could do a performance comparison - based on my usual >>> use case - between both packages. >>> >>> Nico >>> >>> --------------------------------------------------------------- >>> Nicolas Delhomme >>> >>> Genome Biology Computational Support >>> >>> European Molecular Biology Laboratory >>> >>> Tel: +49 6221 387 8310 >>> Email: nicolas.delho...@embl.de >>> Meyerhofstrasse 1 - Postfach 10.2209 >>> 69102 Heidelberg, Germany >>> --------------------------------------------------------------- >>> >>> >>> >>> >>> >>> On 15 Nov 2013, at 18:58, Michael Lawrence <lawrence.mich...@gene.com> >>> wrote: >>> >>>> It might be worth taking a look at rtracklayer and the TranscriptDb >>> stuff in GenomicFeatures. It could save you time, and if you notice any >>> deficiencies in rtracklayer, it would help me. For example, if the >>> attribute parsing is a bottleneck, I can push it down to C. >>>> >>>> Michael >>>> >>>> On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delho...@embl.de> >>> wrote: >>>> Hej Michael, >>>> >>>> Good question really. I have a number of reason for this: >>>> >>>> 1) I’ve been using the genomeIntervals readGff3 function for that - for >>> years now - and I’ve always been satisfied by its performance, especially >>> when parsing the gff/gtf ninth column. The parseGffAttribute and >>> getGffAttribute functions are extremely convenient. I honestly haven’t >>> checked if there was any recent development in rtracklayer / >>> GenomicFeatures similar to these functions. If there were not, I think they >>> would be a great addition to either package. >>>> >>>> 2) As you might guess it’s essentially historical, back when I started >>> that package in 2009, there was not today’s fantastic set of packages. >>>> >>>> 3) As you painfully know, there’s about as many gff format as they are >>> gff files, and because my package is a pipeline I really want to make sure >>> that it’s output is consistent, hence I have strict requirement with >>> regards to the gff/gtf format I accept. Which means that times and again, I >>> have to do slight adjustment but I prefer that over outputting garbage. >>>> >>>> 4) RNA-Seq analyses are filled with pitfalls, hence I think it is >>> essential that users understand the data formats they handle and actually >>> what these analyses are all about. I don’t want them to use my package as >>> they would use a black box. >>>> >>>> 5) It’s educational. There’s a vignette that describes how to parse and >>> convert gff/gtf annotation in the minimal gff/gtf formatted file that would >>> suit my package >>>> >>>> Well, I suppose it’s more than you asked for, but here are my reasons >>> ;-) You’re welcome to comment and I’d be happy to look again at rtracklayer >>> (been through GenomicFeatures recently and I like it much) if you would >>> advise me so. >>>> >>>> Have a nice WE, >>>> >>>> Cheers, >>>> >>>> Nico >>>> >>>> >>>> --------------------------------------------------------------- >>>> Nicolas Delhomme >>>> >>>> Genome Biology Computational Support >>>> >>>> European Molecular Biology Laboratory >>>> >>>> Tel: +49 6221 387 8310 >>>> Email: nicolas.delho...@embl.de >>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>> 69102 Heidelberg, Germany >>>> --------------------------------------------------------------- >>>> >>>> >>>> >>>> >>>> >>>> On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.mich...@gene.com> >>> wrote: >>>> >>>>> Why not use rtracklayer / GenomicFeatures for parsing GTF? That format >>> is tough; no reason for everyone to take it on by themselves. >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delho...@embl.de> >>> wrote: >>>>> Hej Natalia! >>>>> >>>>> There were a number of lines in that particular gtf that violated the >>> assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' >>> column were always set and one of the gene name had a space character in >>> it. I’ve made the parsing of gtf file annotation more flexible/lenient and >>> that should resolve that particular issue you had. The changes should >>> propagate in ~2 days to Bioc with easyRNASeq version 1.8.2. >>>>> >>>>> Rather than using the geneModel, which implementation is old and has >>> gotten slow because of changes in the underlying architecture, I prefer an >>> approach where I >>>>> 1) filter the gtf / gff annotation file for only those lines I’m >>> interested in (e.g. of type exon, mRNA and gene for a gff file) >>>>> 2) collapse every exon of a gene into what I call now a “synthetic >>> transcript”. The reason for changing the naming from geneModel to synthetic >>> transcript is that “gene model” has different meaning depending on the >>> field of application. >>>>> 3) use easyRNASeq with the modified annotation and the “transcript” >>> count method. >>>>> >>>>> I’ve detailed that procedure in the vignette section 7.1 . >>>>> >>>>> Cheers, >>>>> >>>>> Nico >>>>> >>>>> --------------------------------------------------------------- >>>>> Nicolas Delhomme >>>>> >>>>> Genome Biology Computational Support >>>>> >>>>> European Molecular Biology Laboratory >>>>> >>>>> Tel: +49 6221 387 8310 >>>>> Email: nicolas.delho...@embl.de >>>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>>> 69102 Heidelberg, Germany >>>>> --------------------------------------------------------------- >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On 12 Nov 2013, at 14:21, Nicolas Delhomme <delho...@embl.de> wrote: >>>>> >>>>>> Hej Natalia! >>>>>> >>>>>> This is not the first time that I’ve seen this error on the list, >>> but I’ve not been able to reproduce it so far with my own data. Would you >>> mind sharing some data offline, just an excerpt of your files would do. If >>> that’s OK, I can create and give access to a folder on my box account. >>>>>> >>>>>> I had already relaxed the constraint on parsing a gtf file in a >>> previous update but forgot to reflect the changes in the error message. >>> Only the gene_id and transcript_id are actually mandatory. I would not >>> expect any issue with the EnsEMBL gtf file, but I’ll have a look at why it >>> fails for Gallus galls one and let you know asap. >>>>>> >>>>>> This as nothing to do with this error, but by looking at your >>> command line, I saw that you provide a character vector to the chrSize >>> argument. This is not necessary as this information is extracted from the >>> bam file in your case, so you can just drop the chrSizes = “chrSizes” from >>> your command line. I’ve added some extra check in the method to detect this >>> now. Thanks :-) >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Nico >>>>>> >>>>>> --------------------------------------------------------------- >>>>>> Nicolas Delhomme >>>>>> >>>>>> Genome Biology Computational Support >>>>>> >>>>>> European Molecular Biology Laboratory >>>>>> >>>>>> Tel: +49 6221 387 8310 >>>>>> Email: nicolas.delho...@embl.de >>>>>> Meyerhofstrasse 1 - Postfach 10.2209 >>>>>> 69102 Heidelberg, Germany >>>>>> --------------------------------------------------------------- >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 11 Nov 2013, at 16:31, Natalia [guest] <gu...@bioconductor.org> >>> wrote: >>>>>> >>>>>>> >>>>>>> Dear all, >>>>>>> I would like to make a count table to use it in DESeq. I´ve tried >>> to use easyRNAseq but I have a problem with the annotation file. I’ve >>> downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run >>> into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot >>> is empty, I modified the file and added chr before the chromosome number. >>> The next problem was this: >>>>>>> >>>>>>> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all >>> the required fields: gene_id, transcript_id, exon_number, gene_name. >>>>>>> >>>>>>> To solve this problem: >>>>>>> - I deleted all the entries without gene_name (first example): >>>>>>> >>>>>>> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; >>> exon_number "1"; gene_biotype "protein_coding"; exon_id >>> "ENSGALE00000301221"; >>>>>>> >>>>>>> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; >>> exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; >>> transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; >>>>>>> >>>>>>> - I checked the chromosome numbers and deleted the entries that >>> didn’t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can’t >>> find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf >>> file, I don’t know if it is a problem): >>>>>>> >>>>>>> - I searched for semicolons and single quotes ‘ in the gene names, >>> but I didn’t find any on the final file. >>>>>>> >>>>>>> - I deleted all the columns after gene_name. >>>>>>> >>>>>>> So finally the annotation file entries look like this: >>>>>>> chr1 protein_coding exon 19962541 19963992 >>>>>>> . + . gene_id "ENSGALG00000000003"; >>> transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; >>>>>>> >>>>>>> Nothing works; the error message is always the same. So, I don’t >>> know what else I can do. Could you please help me? >>>>>>> Thank you in advance! >>>>>>> Cheers >>>>>>> >>>>>>> Natalia >>>>>>> >>>>>>> >>>>>>> here is my code: >>>>>>>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", >>> chrSizes="chrSizes", annotationMethod="gtf", >>> annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", >>> summarization="geneModels", format="bam", gapped=TRUE, >>> filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", >>> conditions=conditions) >>>>>>> Checking arguments... >>>>>>> Fetching annotations... >>>>>>> Read 334620 records >>>>>>> Error en .getGtfRange(organismName(obj), filename = filename, >>> ignoreWarnings = ignoreWarnings, : >>>>>>> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all >>> the required fields: gene_id, transcript_id, exon_number, gene_name. >>>>>>> Además: Mensajes de aviso perdidos >>>>>>> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = >>> "chrSizes", : >>>>>>> Your organism has no mapping defined to perform the validity check >>> for the UCSC compliance of the chromosome name. >>>>>>> Defined organism's mapping can be listed using the 'knownOrganisms' >>> function. >>>>>>> To benefit from the validity check, you can provide a 'chr.map' to >>> your 'easyRNASeq' function call. >>>>>>> As you did not do so, 'validity.check' is turned off >>>>>>> 2: In .Method(..., deparse.level = deparse.level) : >>>>>>> number of columns of result is not a multiple of vector length (arg >>> 1) >>>>>>> >>>>>>>> traceback() >>>>>>> 6: stop(paste("Your gtf file: ", filename, " does not contain all >>> the required fields: ", >>>>>>> paste(fields, collapse = ", "), ".", sep = "")) >>>>>>> 5: .getGtfRange(organismName(obj), filename = filename, >>> ignoreWarnings = ignoreWarnings, >>>>>>> ...) >>>>>>> 4: fetchAnnotation(obj, method = annotationMethod, filename = >>> annotationFile, >>>>>>> ignoreWarnings = ignoreWarnings, ...) >>>>>>> 3: fetchAnnotation(obj, method = annotationMethod, filename = >>> annotationFile, >>>>>>> ignoreWarnings = ignoreWarnings, ...) >>>>>>> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = >>> "chrSizes", >>>>>>> annotationMethod = "gtf", annotationFile = " >>> Gallus_gallus.Galgal4.73.gtf ", >>>>>>> count = "genes", summarization = "geneModels", format = "bam", >>>>>>> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), >>>>>>> outputFormat = "DESeq", conditions = conditions) >>>>>>> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = >>> "chrSizes", >>>>>>> annotationMethod = "gtf", annotationFile = " >>> Gallus_gallus.Galgal4.73.gtf ", >>>>>>> count = "genes", summarization = "geneModels", format = "bam", >>>>>>> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), >>>>>>> outputFormat = "DESeq", conditions = conditions) >>>>>>> >>>>>>> >>>>>>> -- output of sessionInfo(): >>>>>>> >>>>>>>> sessionInfo() >>>>>>> R version 3.0.2 (2013-09-25) >>>>>>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>>>>>> >>>>>>> locale: >>>>>>> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 >>> LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C >>> LC_TIME=Spanish_Spain.1252 >>>>>>> >>>>>>> attached base packages: >>>>>>> [1] parallel stats graphics grDevices utils datasets >>> methods base >>>>>>> >>>>>>> other attached packages: >>>>>>> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 >>> easyRNASeq_1.8.1 ShortRead_1.20.0 >>>>>>> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 >>> DESeq_1.14.0 lattice_0.20-23 >>>>>>> [9] locfit_1.5-9.1 Biostrings_2.30.0 >>> XVector_0.2.0 IRanges_1.20.4 >>>>>>> [13] edgeR_3.4.0 limma_3.18.2 >>> biomaRt_2.18.0 Biobase_2.22.0 >>>>>>> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 >>> intervals_0.14.0 BiocInstaller_1.12.0 >>>>>>> >>>>>>> loaded via a namespace (and not attached): >>>>>>> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 >>> DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 >>>>>>> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 >>> RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 >>> splines_3.0.2 >>>>>>> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 >>> XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Sent via the guest posting facility at bioconductor.org. >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> bioconduc...@r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> bioconduc...@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>> >>> >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/bioc-devel >> > > > -- > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel