Re: [Bioc-devel] [BioC] GTF file error when using easyRNAseq

Nicolas Delhomme Fri, 15 Nov 2013 10:16:31 -0800

Took that thread to the devel list, just feels more appropriate with regards to 
the content.


I already have that on my TODO list :-). This is not up-to-date, i.e. I haven’t 
done the comparison in ~2 years, but last time I did, genomeIntervals attribute 
parsing was faster than rtracklayer equivalent. I suppose that’s because it is 
already implemented in C in genomeIntervals. As said I don’t have any actual 
comparative numbers, still you might want to have a look at the genomeIntervals 
code. As I don’t think that genomeIntervals get as much exposition as 
rtracklayer does, many more people would benefit from an equivalent rtracklayer 
implementation. If you’re interested, I could do a performance comparison - 
based on my usual use case - between both packages.

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 15 Nov 2013, at 18:58, Michael Lawrence <lawrence.mich...@gene.com> wrote:

> It might be worth taking a look at rtracklayer and the TranscriptDb stuff in 
> GenomicFeatures. It could save you time, and if you notice any deficiencies 
> in rtracklayer, it would help me. For example, if the attribute parsing is a 
> bottleneck, I can push it down to C.
> 
> Michael
> 
> On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delho...@embl.de> wrote:
> Hej Michael,
> 
> Good question really. I have a number of reason for this:
> 
> 1) I’ve been using the genomeIntervals readGff3 function for that - for years 
> now - and I’ve always been satisfied by its performance, especially when 
> parsing the gff/gtf ninth column. The parseGffAttribute and getGffAttribute 
> functions are extremely convenient. I honestly haven’t checked if there was 
> any recent development in rtracklayer / GenomicFeatures similar to these 
> functions. If there were not, I think they would be a great addition to 
> either package.
> 
> 2) As you might guess it’s essentially historical, back when I started that 
> package in 2009, there was not today’s fantastic set of packages.
> 
> 3) As you painfully know, there’s about as many gff format as they are gff 
> files, and because my package is a pipeline I really want to make sure that 
> it’s output is consistent, hence I have strict requirement with regards to 
> the gff/gtf format I accept. Which means that times and again, I have to do 
> slight adjustment but I prefer that over outputting garbage.
> 
> 4) RNA-Seq analyses are filled with pitfalls, hence I think it is essential 
> that users understand the data formats they handle and actually what these 
> analyses are all about. I don’t want them to use my package as they would use 
> a black box.
> 
> 5) It’s educational. There’s a vignette that describes how to parse and 
> convert gff/gtf annotation in the minimal gff/gtf formatted file that would 
> suit my package
> 
> Well, I suppose it’s more than you asked for, but here are my reasons ;-) 
> You’re welcome to comment and I’d be happy to look again at rtracklayer (been 
> through GenomicFeatures recently and I like it much) if you would advise me 
> so.
> 
> Have a nice WE,
> 
> Cheers,
> 
> Nico
> 
> 
> ---------------------------------------------------------------
> Nicolas Delhomme
> 
> Genome Biology Computational Support
> 
> European Molecular Biology Laboratory
> 
> Tel: +49 6221 387 8310
> Email: nicolas.delho...@embl.de
> Meyerhofstrasse 1 - Postfach 10.2209
> 69102 Heidelberg, Germany
> ---------------------------------------------------------------
> 
> 
> 
> 
> 
> On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.mich...@gene.com> wrote:
> 
> > Why not use rtracklayer / GenomicFeatures for parsing GTF? That format is 
> > tough; no reason for everyone to take it on by themselves.
> >
> >
> >
> >
> > On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delho...@embl.de> wrote:
> > Hej Natalia!
> >
> > There were a number of lines in that particular gtf that violated the 
> > assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' 
> > column were always set and one of the gene name had a space character in 
> > it. I’ve made the parsing of gtf file annotation more flexible/lenient and 
> > that should resolve that particular issue you had. The changes should 
> > propagate in ~2 days to Bioc with easyRNASeq version 1.8.2.
> >
> > Rather than using the geneModel, which implementation is old and has gotten 
> > slow because of changes in the underlying architecture, I prefer an 
> > approach where I
> > 1) filter the gtf / gff annotation file for only those lines I’m interested 
> > in (e.g. of type exon, mRNA and gene for a gff file)
> > 2) collapse every exon of a gene into what I call now a “synthetic 
> > transcript”. The reason for changing the naming from geneModel to synthetic 
> > transcript is that “gene model” has different meaning depending on the 
> > field of application.
> > 3) use easyRNASeq with the modified annotation and the “transcript” count 
> > method.
> >
> > I’ve detailed that procedure in the vignette section 7.1 .
> >
> > Cheers,
> >
> > Nico
> >
> > ---------------------------------------------------------------
> > Nicolas Delhomme
> >
> > Genome Biology Computational Support
> >
> > European Molecular Biology Laboratory
> >
> > Tel: +49 6221 387 8310
> > Email: nicolas.delho...@embl.de
> > Meyerhofstrasse 1 - Postfach 10.2209
> > 69102 Heidelberg, Germany
> > ---------------------------------------------------------------
> >
> >
> >
> >
> >
> > On 12 Nov 2013, at 14:21, Nicolas Delhomme <delho...@embl.de> wrote:
> >
> > > Hej Natalia!
> > >
> > > This is not the first time that I’ve seen this error on the list, but 
> > > I’ve not been able to reproduce it so far with my own data. Would you 
> > > mind sharing some data offline, just an excerpt of your files would do. 
> > > If that’s OK, I can create and give access to a folder on my box account.
> > >
> > > I had already relaxed the constraint on parsing a gtf file in a previous 
> > > update but forgot to reflect the changes in the error message. Only the  
> > > gene_id and transcript_id are actually mandatory. I would not expect any 
> > > issue with the EnsEMBL gtf file, but I’ll have a look at why it fails for 
> > > Gallus galls one and let you know asap.
> > >
> > > This as nothing to do with this error, but by looking at your command 
> > > line, I saw that you provide a character vector to the chrSize argument. 
> > > This is not necessary as this information is extracted from the bam file 
> > > in your case, so you can just drop the chrSizes = “chrSizes” from your 
> > > command line. I’ve added some extra check in the method to detect this 
> > > now. Thanks :-)
> > >
> > > Cheers,
> > >
> > > Nico
> > >
> > > ---------------------------------------------------------------
> > > Nicolas Delhomme
> > >
> > > Genome Biology Computational Support
> > >
> > > European Molecular Biology Laboratory
> > >
> > > Tel: +49 6221 387 8310
> > > Email: nicolas.delho...@embl.de
> > > Meyerhofstrasse 1 - Postfach 10.2209
> > > 69102 Heidelberg, Germany
> > > ---------------------------------------------------------------
> > >
> > >
> > >
> > >
> > >
> > > On 11 Nov 2013, at 16:31, Natalia [guest] <gu...@bioconductor.org> wrote:
> > >
> > >>
> > >> Dear all,
> > >> I would like to make a count table to use it in DESeq. I´ve tried to use 
> > >> easyRNAseq but I have a problem with the annotation file. I’ve 
> > >> downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run 
> > >> into the problem Error in .doBasicCount(obj) : The genomicAnnotation 
> > >> slot is empty, I modified the file and added chr before the chromosome 
> > >> number. The next problem was this:
> > >>
> > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the 
> > >> required fields: gene_id, transcript_id, exon_number, gene_name.
> > >>
> > >> To solve this problem:
> > >> - I deleted all the entries without gene_name (first example):
> > >>
> > >> gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891"; 
> > >> exon_number "1"; gene_biotype "protein_coding"; exon_id 
> > >> "ENSGALE00000301221";
> > >>
> > >> gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914"; 
> > >> exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding"; 
> > >> transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891";
> > >>
> > >> - I checked the chromosome numbers and deleted the entries that didn’t 
> > >> match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can’t find 
> > >> any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf 
> > >> file, I don’t know if it is a problem):
> > >>
> > >> - I searched for semicolons and single quotes ‘ in the gene names, but I 
> > >> didn’t find any on the final file.
> > >>
> > >> - I deleted all the columns after gene_name.
> > >>
> > >> So finally the annotation file entries look like this:
> > >> chr1 protein_coding  exon    19962541        19963992
> > >>      .       +       .       gene_id "ENSGALG00000000003"; transcript_id 
> > >> "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2";
> > >>
> > >> Nothing works; the error message is always the same. So, I don’t know 
> > >> what else I can do. Could you please help me?
> > >> Thank you in advance!
> > >> Cheers
> > >>
> > >> Natalia
> > >>
> > >>
> > >> here is my code:
> > >>> count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus", 
> > >>> chrSizes="chrSizes", annotationMethod="gtf", 
> > >>> annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes", 
> > >>> summarization="geneModels", format="bam", gapped=TRUE, 
> > >>> filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq", 
> > >>> conditions=conditions)
> > >> Checking arguments...
> > >> Fetching annotations...
> > >> Read 334620 records
> > >> Error en .getGtfRange(organismName(obj), filename = filename, 
> > >> ignoreWarnings = ignoreWarnings,  :
> > >> Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all the 
> > >> required fields: gene_id, transcript_id, exon_number, gene_name.
> > >> Además: Mensajes de aviso perdidos
> > >> 1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = 
> > >> "chrSizes",  :
> > >> Your organism has no mapping defined to perform the validity check for 
> > >> the UCSC compliance of the chromosome name.
> > >> Defined organism's mapping can be listed using the 'knownOrganisms' 
> > >> function.
> > >> To benefit from the validity check, you can provide a 'chr.map' to your 
> > >> 'easyRNASeq' function call.
> > >> As you did not do so, 'validity.check' is turned off
> > >> 2: In .Method(..., deparse.level = deparse.level) :
> > >> number of columns of result is not a multiple of vector length (arg 1)
> > >>
> > >>> traceback()
> > >> 6: stop(paste("Your gtf file: ", filename, " does not contain all the 
> > >> required fields: ",
> > >>    paste(fields, collapse = ", "), ".", sep = ""))
> > >> 5: .getGtfRange(organismName(obj), filename = filename, ignoreWarnings = 
> > >> ignoreWarnings,
> > >>    ...)
> > >> 4: fetchAnnotation(obj, method = annotationMethod, filename = 
> > >> annotationFile,
> > >>    ignoreWarnings = ignoreWarnings, ...)
> > >> 3: fetchAnnotation(obj, method = annotationMethod, filename = 
> > >> annotationFile,
> > >>    ignoreWarnings = ignoreWarnings, ...)
> > >> 2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = 
> > >> "chrSizes",
> > >>    annotationMethod = "gtf", annotationFile = " 
> > >> Gallus_gallus.Galgal4.73.gtf ",
> > >>    count = "genes", summarization = "geneModels", format = "bam",
> > >>    gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"),
> > >>    outputFormat = "DESeq", conditions = conditions)
> > >> 1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes = 
> > >> "chrSizes",
> > >>    annotationMethod = "gtf", annotationFile = " 
> > >> Gallus_gallus.Galgal4.73.gtf ",
> > >>    count = "genes", summarization = "geneModels", format = "bam",
> > >>    gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"),
> > >>    outputFormat = "DESeq", conditions = conditions)
> > >>
> > >>
> > >> -- output of sessionInfo():
> > >>
> > >>> sessionInfo()
> > >> R version 3.0.2 (2013-09-25)
> > >> Platform: x86_64-w64-mingw32/x64 (64-bit)
> > >>
> > >> locale:
> > >> [1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    
> > >> LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C                   
> > >> LC_TIME=Spanish_Spain.1252
> > >>
> > >> attached base packages:
> > >> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
> > >>  base
> > >>
> > >> other attached packages:
> > >> [1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0                 
> > >>      easyRNASeq_1.8.1                     ShortRead_1.20.0
> > >> [5] Rsamtools_1.14.1                     GenomicRanges_1.14.3            
> > >>      DESeq_1.14.0                         lattice_0.20-23
> > >> [9] locfit_1.5-9.1                       Biostrings_2.30.0               
> > >>      XVector_0.2.0                        IRanges_1.20.4
> > >> [13] edgeR_3.4.0                          limma_3.18.2                   
> > >>       biomaRt_2.18.0                       Biobase_2.22.0
> > >> [17] genomeIntervals_1.18.0               BiocGenerics_0.8.0             
> > >>       intervals_0.14.0                     BiocInstaller_1.12.0
> > >>
> > >> loaded via a namespace (and not attached):
> > >> [1] annotate_1.40.0      AnnotationDbi_1.24.0 bitops_1.0-6         
> > >> DBI_0.2-7            genefilter_1.44.0    geneplotter_1.40.0   grid_3.0.2
> > >> [8] hwriter_1.3          latticeExtra_0.6-26  LSD_2.5              
> > >> RColorBrewer_1.0-5   RCurl_1.95-4.1       RSQLite_0.11.4       
> > >> splines_3.0.2
> > >> [15] stats4_3.0.2         survival_2.37-4      tools_3.0.2          
> > >> XML_3.98-1.1         xtable_1.7-1         zlibbioc_1.8.0
> > >>
> > >>
> > >> --
> > >> Sent via the guest posting facility at bioconductor.org.
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > bioconduc...@r-project.org
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives: 
> > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > bioconduc...@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: 
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> 
> 

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] [BioC] GTF file error when using easyRNAseq

Reply via email to