Re: [Bioc-devel] [BioC] GTF file error when using easyRNAseq

Martin Morgan Fri, 15 Nov 2013 11:14:35 -0800

On 11/15/2013 10:22 AM, Michael Lawrence wrote:

Doesn't look like genomeIntervals has any C code (?), so a performance
comparison would be interesting. rtracklayer jumps through all sorts of
hoops to handle obscure things like URL encoding in GFF3. The code in
genomeIntervals seems more streamlined.

Wanted to mention, and it would be good to know if this was not helpful at all,that the Ensembl gtf files are available through AnnotationHub as GRanges objects


> library(AnnotationHub)
> hub = AnnotationHub()
> hub$ensembl.release.73.<tab>
hub$ensembl.release.73.fasta. ... [378]
hub$ensembl.release.73.gtf. ... [63]

> xx =hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4.73.gtf_0.0.1.RData

> xx
GRanges with 381368 ranges and 12 metadata columns:
                 seqnames       ranges strand   |         source        type
                    <Rle>    <IRanges>  <Rle>   |       <factor>    <factor>
       [1]              1 [1735, 2449]      +   | protein_coding        exon
       [2]              1 [2379, 2449]      +   | protein_coding         CDS
               score     phase            gene_id      transcript_id
           <numeric> <integer>        <character>        <character>
       [1]      <NA>      <NA> ENSGALG00000009771 ENSGALT00000015891
       [2]      <NA>         0 ENSGALG00000009771 ENSGALT00000015891
           exon_number   gene_biotype            exon_id         protein_id
             <numeric>    <character>        <character>        <character>
       [1]           1 protein_coding ENSGALE00000301221               <NA>
       [2]           1 protein_coding               <NA> ENSGALP00000015874
                gene_name    transcript_name
              <character>        <character>
       [1]           <NA>               <NA>
       [2]           <NA>               <NA>
 [ reached getOption("max.print") -- omitted 9 rows ]
  ---
  seqlengths:
                    1                  2 ...     AADN03010940.1
                   NA                 NA ...                 NA

Martin







On Fri, Nov 15, 2013 at 10:14 AM, Nicolas Delhomme <delho...@embl.de> wrote:

Took that thread to the devel list, just feels more appropriate with
regards to the content.

I already have that on my TODO list :-). This is not up-to-date, i.e. I
haven’t done the comparison in ~2 years, but last time I did,
genomeIntervals attribute parsing was faster than rtracklayer equivalent. I
suppose that’s because it is already implemented in C in genomeIntervals.
As said I don’t have any actual comparative numbers, still you might want
to have a look at the genomeIntervals code. As I don’t think that
genomeIntervals get as much exposition as rtracklayer does, many more
people would benefit from an equivalent rtracklayer implementation. If
you’re interested, I could do a performance comparison - based on my usual
use case - between both packages.

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 15 Nov 2013, at 18:58, Michael Lawrence <lawrence.mich...@gene.com>
wrote:

It might be worth taking a look at rtracklayer and the TranscriptDb

stuff in GenomicFeatures. It could save you time, and if you notice any
deficiencies in rtracklayer, it would help me. For example, if the
attribute parsing is a bottleneck, I can push it down to C.


Michael

On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme <delho...@embl.de>

wrote:

Hej Michael,

Good question really. I have a number of reason for this:

1) I’ve been using the genomeIntervals readGff3 function for that - for

years now - and I’ve always been satisfied by its performance, especially
when parsing the gff/gtf ninth column. The parseGffAttribute and
getGffAttribute functions are extremely convenient. I honestly haven’t
checked if there was any recent development in rtracklayer /
GenomicFeatures similar to these functions. If there were not, I think they
would be a great addition to either package.


2) As you might guess it’s essentially historical, back when I started

that package in 2009, there was not today’s fantastic set of packages.


3) As you painfully know, there’s about as many gff format as they are

gff files, and because my package is a pipeline I really want to make sure
that it’s output is consistent, hence I have strict requirement with
regards to the gff/gtf format I accept. Which means that times and again, I
have to do slight adjustment but I prefer that over outputting garbage.


4) RNA-Seq analyses are filled with pitfalls, hence I think it is

essential that users understand the data formats they handle and actually
what these analyses are all about. I don’t want them to use my package as
they would use a black box.


5) It’s educational. There’s a vignette that describes how to parse and

convert gff/gtf annotation in the minimal gff/gtf formatted file that would
suit my package


Well, I suppose it’s more than you asked for, but here are my reasons

;-) You’re welcome to comment and I’d be happy to look again at rtracklayer
(been through GenomicFeatures recently and I like it much) if you would
advise me so.


Have a nice WE,

Cheers,

Nico


---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 15 Nov 2013, at 12:44, Michael Lawrence <lawrence.mich...@gene.com>

wrote:

Why not use rtracklayer / GenomicFeatures for parsing GTF? That format

is tough; no reason for everyone to take it on by themselves.





On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme <delho...@embl.de>

wrote:

Hej Natalia!

There were a number of lines in that particular gtf that violated the

assumptions I had about EnsEMBL gtf. Not all the fields in the attributes'
column were always set and one of the gene name had a space character in
it. I’ve made the parsing of gtf file annotation more flexible/lenient and
that should resolve that particular issue you had. The changes should
propagate in ~2 days to Bioc with easyRNASeq version 1.8.2.


Rather than using the geneModel, which implementation is old and has

gotten slow because of changes in the underlying architecture, I prefer an
approach where I

1) filter the gtf / gff annotation file for only those lines I’m

interested in (e.g. of type exon, mRNA and gene for a gff file)

2) collapse every exon of a gene into what I call now a “synthetic

transcript”. The reason for changing the naming from geneModel to synthetic
transcript is that “gene model” has different meaning depending on the
field of application.

3) use easyRNASeq with the modified annotation and the “transcript”

count method.


I’ve detailed that procedure in the vignette section 7.1 .

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 12 Nov 2013, at 14:21, Nicolas Delhomme <delho...@embl.de> wrote:

Hej Natalia!

This is not the first time that I’ve seen this error on the list,

but I’ve not been able to reproduce it so far with my own data. Would you
mind sharing some data offline, just an excerpt of your files would do. If
that’s OK, I can create and give access to a folder on my box account.


I had already relaxed the constraint on parsing a gtf file in a

previous update but forgot to reflect the changes in the error message.
Only the  gene_id and transcript_id are actually mandatory. I would not
expect any issue with the EnsEMBL gtf file, but I’ll have a look at why it
fails for Gallus galls one and let you know asap.


This as nothing to do with this error, but by looking at your

command line, I saw that you provide a character vector to the chrSize
argument. This is not necessary as this information is extracted from the
bam file in your case, so you can just drop the chrSizes = “chrSizes” from
your command line. I’ve added some extra check in the method to detect this
now. Thanks :-)


Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 11 Nov 2013, at 16:31, Natalia [guest] <gu...@bioconductor.org>

wrote:


Dear all,
I would like to make a count table to use it in DESeq. I´ve tried

to use easyRNAseq but I have a problem with the annotation file. I’ve
downloaded the file Gallus_gallus.Galgal4.73.gtf from Ensembl. As I run
into the problem Error in .doBasicCount(obj) : The genomicAnnotation slot
is empty, I modified the file and added chr before the chromosome number.
The next problem was this:


Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all

the required fields: gene_id, transcript_id, exon_number, gene_name.


To solve this problem:
- I deleted all the entries without gene_name (first example):

gene_id "ENSGALG00000009771"; transcript_id "ENSGALT00000015891";

exon_number "1"; gene_biotype "protein_coding"; exon_id
"ENSGALE00000301221";


gene_id "ENSGALG00000009783"; transcript_id "ENSGALT00000015914";

exon_number "2"; gene_name "GOLGB1"; gene_biotype "protein_coding";
transcript_name "GOLGB1-201"; exon_id "ENSGALE00000105891";


- I checked the chromosome numbers and deleted the entries that

didn’t match any chromosome from BSgenome.Ggallus.UCSC.galGal4 (I can’t
find any entry corresponding to chr32 in the Gallus_gallus.Galgal4.73.gtf
file, I don’t know if it is a problem):


- I searched for semicolons and single quotes ‘ in the gene names,

but I didn’t find any on the final file.


- I deleted all the columns after gene_name.

So finally the annotation file entries look like this:
chr1 protein_coding  exon    19962541        19963992
      .       +       .       gene_id "ENSGALG00000000003";

transcript_id "ENSGALT00000000003"; exon_number "2"; gene_name "PANX2";


Nothing works; the error message is always the same. So, I don’t

know what else I can do. Could you please help me?

Thank you in advance!
Cheers

Natalia


here is my code:

count.table <- easyRNASeq("/RNAseqGallus", organism="Ggallus",

chrSizes="chrSizes", annotationMethod="gtf",
annotationFile="Gallus_gallus.Galgal4.73.gtf", count="genes",
summarization="geneModels", format="bam", gapped=TRUE,
filenames=c("NS1gallus.bam","NS2gallus.bam"), outputFormat="DESeq",
conditions=conditions)

Checking arguments...
Fetching annotations...
Read 334620 records
Error en .getGtfRange(organismName(obj), filename = filename,

ignoreWarnings = ignoreWarnings,  :

Your gtf file: Gallus_gallus.Galgal4.73.gtf does not contain all

the required fields: gene_id, transcript_id, exon_number, gene_name.

Además: Mensajes de aviso perdidos
1: In easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =

"chrSizes",  :

Your organism has no mapping defined to perform the validity check

for the UCSC compliance of the chromosome name.

Defined organism's mapping can be listed using the 'knownOrganisms'

function.

To benefit from the validity check, you can provide a 'chr.map' to

your 'easyRNASeq' function call.

As you did not do so, 'validity.check' is turned off
2: In .Method(..., deparse.level = deparse.level) :
number of columns of result is not a multiple of vector length (arg

1)

traceback()

6: stop(paste("Your gtf file: ", filename, " does not contain all

the required fields: ",

    paste(fields, collapse = ", "), ".", sep = ""))
5: .getGtfRange(organismName(obj), filename = filename,

ignoreWarnings = ignoreWarnings,

    ...)
4: fetchAnnotation(obj, method = annotationMethod, filename =

annotationFile,

    ignoreWarnings = ignoreWarnings, ...)
3: fetchAnnotation(obj, method = annotationMethod, filename =

annotationFile,

    ignoreWarnings = ignoreWarnings, ...)
2: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =

"chrSizes",

    annotationMethod = "gtf", annotationFile = "

Gallus_gallus.Galgal4.73.gtf ",

    count = "genes", summarization = "geneModels", format = "bam",
    gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"),
    outputFormat = "DESeq", conditions = conditions)
1: easyRNASeq("/RNAseqGallus", organism = "Ggallus", chrSizes =

"chrSizes",

    annotationMethod = "gtf", annotationFile = "

Gallus_gallus.Galgal4.73.gtf ",

    count = "genes", summarization = "geneModels", format = "bam",
    gapped = TRUE, filenames = c("NS1gallus.bam", "NS2gallus.bam"),
    outputFormat = "DESeq", conditions = conditions)


-- output of sessionInfo():

sessionInfo()

R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252

  LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
LC_TIME=Spanish_Spain.1252


attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets

  methods   base


other attached packages:
[1] BSgenome.Ggallus.UCSC.galGal4_1.3.18 BSgenome_1.30.0

            easyRNASeq_1.8.1                     ShortRead_1.20.0

[5] Rsamtools_1.14.1                     GenomicRanges_1.14.3

           DESeq_1.14.0                         lattice_0.20-23

[9] locfit_1.5-9.1                       Biostrings_2.30.0

            XVector_0.2.0                        IRanges_1.20.4

[13] edgeR_3.4.0                          limma_3.18.2

             biomaRt_2.18.0                       Biobase_2.22.0

[17] genomeIntervals_1.18.0               BiocGenerics_0.8.0

             intervals_0.14.0                     BiocInstaller_1.12.0


loaded via a namespace (and not attached):
[1] annotate_1.40.0      AnnotationDbi_1.24.0 bitops_1.0-6

DBI_0.2-7            genefilter_1.44.0    geneplotter_1.40.0   grid_3.0.2

[8] hwriter_1.3          latticeExtra_0.6-26  LSD_2.5

  RColorBrewer_1.0-5   RCurl_1.95-4.1       RSQLite_0.11.4
splines_3.0.2

[15] stats4_3.0.2         survival_2.37-4      tools_3.0.2

  XML_3.98-1.1         xtable_1.7-1         zlibbioc_1.8.0



--
Sent via the guest posting facility at bioconductor.org.


_______________________________________________
Bioconductor mailing list
bioconduc...@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:

http://news.gmane.org/gmane.science.biology.informatics.conductor


_______________________________________________
Bioconductor mailing list
bioconduc...@r-project.org
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives:

http://news.gmane.org/gmane.science.biology.informatics.conductor


        [[alternative HTML version deleted]]



_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] [BioC] GTF file error when using easyRNAseq

Reply via email to