Took that thread to the devel list, just feels more appropriate with regards to the content.
I already have that on my TODO list :-). This is not up-to-date, i.e. I haven’t done the comparison in ~2 years, but last time I did, genomeIntervals attribute parsing was faster than rtracklayer equivalent. I suppose that’s because it is already implemented in C in genomeIntervals. As said I don’t have any actual comparative numbers, still you might want to have a look at the genomeIntervals code. As I don’t think that genomeIntervals get as much exposition as rtracklayer does, many more people would benefit from an equivalent rtracklayer implementation. If you’re interested, I could do a performance comparison - based on my usual use case - between both packages. Nico --------------------------------------------------------------- Nicolas Delhomme Genome Biology Computational Support European Molecular Biology Laboratory Tel: +49 6221 387 8310 Email: nicolas.delho...@embl.de Meyerhofstrasse 1 - Postfach 10.2209 69102 Heidelberg, Germany --------------------------------------------------------------- On 15 Nov 2013, at 18:58, Michael Lawrence <lawrence.mich...@gene.com> wrote: > It might be worth taking a look at rtracklayer and the TranscriptDb stuff in > GenomicFeatures. It could save you time, and if you notice any deficiencies > in rtracklayer, it would help me. For example, if the attribute parsing is a > bottleneck, I can push it down to C. > > Michael > > On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delho...@embl.de> wrote: > Hej Michael, > > Good question really. I have a number of reason for this: > > 1) I’ve been using the genomeIntervals readGff3 function for that - for years > now - and I’ve always been satisfied by its performance, especially when > parsing the gff/gtf ninth column. The parseGffAttribute and getGffAttribute > functions are extremely convenient. I honestly haven’t checked if there was > any recent development in rtracklayer / GenomicFeatures similar to these > functions. If there were not, I think they would be a great addition to > either package. > > 2) As you might guess it’s essentially historical, back when I started that > package in 2009, there was not today’s fantastic set of packages. > > 3) As you painfully know, there’s about as many gff format as they are gff > files, and because my package is a pipeline I really want to make sure that > it’s output is consistent, hence I have strict requirement with regards to > the gff/gtf format I accept. Which means that times and again, I have to do > slight adjustment but I prefer that over outputting garbage. > > 4) RNA-Seq analyses are filled with pitfalls, hence I think it is essential > that users understand the data formats they handle and actually what these > analyses are all about. I don’t want them to use my package as they would use > a black box. > > 5) It’s educational. There’s a vignette that describes how to parse and > convert gff/gtf annotation in the minimal gff/gtf formatted file that would > suit my package > > Well, I suppose it’s more than you asked for, but here are my reasons ;-) > You’re welcome to comment and I’d be happy to look again at rtracklayer (been > through GenomicFeatures recently and I like it much) if you would advise me > so. > > Have a nice WE, > > Cheers, > > Nico > > > --------------------------------------------------------------- > Nicolas Delhomme > > Genome Biology Computational Support > > European Molecular Biology Laboratory > > Tel: +49 6221 387 8310 > Email: nicolas.delho...@embl.de > Meyerhofstrasse 1 - Postfach 10.2209 > 69102 Heidelberg, Germany > --------------------------------------------------------------- > > > > > > On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.mich...@gene.com> wrote: > > > Why not use rtracklayer / GenomicFeatures for parsing GTF? That format is > > tough; no reason for everyone to take it on by themselves. > > > > > > > > > > On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delho...@embl.de> wrote: > > Hej Natalia! > > > > There were a number of lines in that particular gtf that violated the > > assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' > > column were always set and one of the gene name had a space character in > > it. I’ve made the parsing of gtf file annotation more flexible/lenient and > > that should resolve that particular issue you had. The changes should > > propagate in ~2 days to Bioc with easyRNASeq version 1.8.2. > > > > Rather than using the geneModel, which implementation is old and has gotten > > slow because of changes in the underlying architecture, I prefer an > > approach where I > > 1) filter the gtf / gff annotation file for only those lines I’m interested > > in (e.g. of type exon, mRNA and gene for a gff file) > > 2) collapse every exon of a gene into what I call now a “synthetic > > transcript”. The reason for changing the naming from geneModel to synthetic > > transcript is that “gene model” has different meaning depending on the > > field of application. > > 3) use easyRNASeq with the modified annotation and the “transcript” count > > method. > > > > I’ve detailed that procedure in the vignette section 7.1 . > > > > Cheers, > > > > Nico > > > > --------------------------------------------------------------- > > Nicolas Delhomme > > > > Genome Biology Computational Support > > > > European Molecular Biology Laboratory > > > > Tel: +49 6221 387 8310 > > Email: nicolas.delho...@embl.de > > Meyerhofstrasse 1 - Postfach 10.2209 > > 69102 Heidelberg, Germany > > --------------------------------------------------------------- > > > > > > > > > > > > On 12 Nov 2013, at 14:21, Nicolas Delhomme <delho...@embl.de> wrote: > > > > > Hej Natalia! > > > > > > This is not the first time that I’ve seen this error on the list, but > > > I’ve not been able to reproduce it so far with my own data. Would you > > > mind sharing some data offline, just an excerpt of your files would do. > > > If that’s OK, I can create and give access to a folder on my box account. > > > > > > I had already relaxed the constraint on parsing a gtf file in a previous > > > update but forgot to reflect the changes in the error message. Only the > > > gene_id and transcript_id are actually mandatory. I would not expect any > > > issue with the EnsEMBL gtf file, but I’ll have a look at why it fails for > > > Gallus galls one and let you know asap. > > > > > > This as nothing to do with this error, but by looking at your command > > > line, I saw that you provide a character vector to the chrSize argument. > > > This is not necessary as this information is extracted from the bam file > > > in your case, so you can just drop the chrSizes = “chrSizes” from your > > > command line. I’ve added some extra check in the method to detect this > > > now. Thanks :-) > > > > > > Cheers, > > > > > > Nico > > > > > > --------------------------------------------------------------- > > > Nicolas Delhomme > > > > > > Genome Biology Computational Support > > > > > > European Molecular Biology Laboratory > > > > > > Tel: +49 6221 387 8310 > > > Email: nicolas.delho...@embl.de > > > Meyerhofstrasse 1 - Postfach 10.2209 > > > 69102 Heidelberg, Germany > > > --------------------------------------------------------------- > > > > > > > > > > > > > > > > > > On 11 Nov 2013, at 16:31, Natalia [guest] <gu...@bioconductor.org> wrote: > > > > > >> > > >> Dear all, > > >> I would like to make a count table to use it in DESeq. I´ve tried to use > > >> easyRNAseq but I have a problem with the annotation file. I’ve > > >> downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run > > >> into the problem Error in .doBasicCount(obj) : The genomicAnnotation > > >> slot is empty, I modified the file and added chr before the chromosome > > >> number. The next problem was this: > > >> > > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > > >> required fields: gene_id, transcript_id, exon_number, gene_name. > > >> > > >> To solve this problem: > > >> - I deleted all the entries without gene_name (first example): > > >> > > >> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; > > >> exon_number "1"; gene_biotype "protein_coding"; exon_id > > >> "ENSGALE00000301221"; > > >> > > >> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; > > >> exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; > > >> transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891"; > > >> > > >> - I checked the chromosome numbers and deleted the entries that didn’t > > >> match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can’t find > > >> any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf > > >> file, I don’t know if it is a problem): > > >> > > >> - I searched for semicolons and single quotes ‘ in the gene names, but I > > >> didn’t find any on the final file. > > >> > > >> - I deleted all the columns after gene_name. > > >> > > >> So finally the annotation file entries look like this: > > >> chr1 protein_coding exon 19962541 19963992 > > >> . + . gene_id "ENSGALG00000000003"; transcript_id > > >> "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2"; > > >> > > >> Nothing works; the error message is always the same. So, I don’t know > > >> what else I can do. Could you please help me? > > >> Thank you in advance! > > >> Cheers > > >> > > >> Natalia > > >> > > >> > > >> here is my code: > > >>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", > > >>> chrSizes="chrSizes", annotationMethod="gtf", > > >>> annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", > > >>> summarization="geneModels", format="bam", gapped=TRUE, > > >>> filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", > > >>> conditions=conditions) > > >> Checking arguments... > > >> Fetching annotations... > > >> Read 334620 records > > >> Error en .getGtfRange(organismName(obj), filename = filename, > > >> ignoreWarnings = ignoreWarnings, : > > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the > > >> required fields: gene_id, transcript_id, exon_number, gene_name. > > >> Además: Mensajes de aviso perdidos > > >> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > > >> "chrSizes", : > > >> Your organism has no mapping defined to perform the validity check for > > >> the UCSC compliance of the chromosome name. > > >> Defined organism's mapping can be listed using the 'knownOrganisms' > > >> function. > > >> To benefit from the validity check, you can provide a 'chr.map' to your > > >> 'easyRNASeq' function call. > > >> As you did not do so, 'validity.check' is turned off > > >> 2: In .Method(..., deparse.level = deparse.level) : > > >> number of columns of result is not a multiple of vector length (arg 1) > > >> > > >>> traceback() > > >> 6: stop(paste("Your gtf file: ", filename, " does not contain all the > > >> required fields: ", > > >> paste(fields, collapse = ", "), ".", sep = "")) > > >> 5: .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = > > >> ignoreWarnings, > > >> ...) > > >> 4: fetchAnnotation(obj, method = annotationMethod, filename = > > >> annotationFile, > > >> ignoreWarnings = ignoreWarnings, ...) > > >> 3: fetchAnnotation(obj, method = annotationMethod, filename = > > >> annotationFile, > > >> ignoreWarnings = ignoreWarnings, ...) > > >> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > > >> "chrSizes", > > >> annotationMethod = "gtf", annotationFile = " > > >> Gallus_gallus.Galgal4.73.gtf ", > > >> count = "genes", summarization = "geneModels", format = "bam", > > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > > >> outputFormat = "DESeq", conditions = conditions) > > >> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = > > >> "chrSizes", > > >> annotationMethod = "gtf", annotationFile = " > > >> Gallus_gallus.Galgal4.73.gtf ", > > >> count = "genes", summarization = "geneModels", format = "bam", > > >> gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"), > > >> outputFormat = "DESeq", conditions = conditions) > > >> > > >> > > >> -- output of sessionInfo(): > > >> > > >>> sessionInfo() > > >> R version 3.0.2 (2013-09-25) > > >> Platform: x86_64-w64-mingw32/x64 (64-bit) > > >> > > >> locale: > > >> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 > > >> LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C > > >> LC_TIME=Spanish_Spain.1252 > > >> > > >> attached base packages: > > >> [1] parallel stats graphics grDevices utils datasets methods > > >> base > > >> > > >> other attached packages: > > >> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0 > > >> easyRNASeq_1.8.1 ShortRead_1.20.0 > > >> [5] Rsamtools_1.14.1 GenomicRanges_1.14.3 > > >> DESeq_1.14.0 lattice_0.20-23 > > >> [9] locfit_1.5-9.1 Biostrings_2.30.0 > > >> XVector_0.2.0 IRanges_1.20.4 > > >> [13] edgeR_3.4.0 limma_3.18.2 > > >> biomaRt_2.18.0 Biobase_2.22.0 > > >> [17] genomeIntervals_1.18.0 BiocGenerics_0.8.0 > > >> intervals_0.14.0 BiocInstaller_1.12.0 > > >> > > >> loaded via a namespace (and not attached): > > >> [1] annotate_1.40.0 AnnotationDbi_1.24.0 bitops_1.0-6 > > >> DBI_0.2-7 genefilter_1.44.0 geneplotter_1.40.0 grid_3.0.2 > > >> [8] hwriter_1.3 latticeExtra_0.6-26 LSD_2.5 > > >> RColorBrewer_1.0-5 RCurl_1.95-4.1 RSQLite_0.11.4 > > >> splines_3.0.2 > > >> [15] stats4_3.0.2 survival_2.37-4 tools_3.0.2 > > >> XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.8.0 > > >> > > >> > > >> -- > > >> Sent via the guest posting facility at bioconductor.org. > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > bioconduc...@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > bioconduc...@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel