Re: [Bioc-devel] [BioC] GTF file error when using easyRNAseq

2013-11-15 Thread Nicolas Delhomme
Took that thread to the devel list, just feels more appropriate with regards to 
the content.

I already have that on my TODO list :-). This is not up-to-date, i.e. I haven’t 
done the comparison in ~2 years, but last time I did, genomeIntervals attribute 
parsing was faster than rtracklayer equivalent. I suppose that’s because it is 
already implemented in C in genomeIntervals. As said I don’t have any actual 
comparative numbers, still you might want to have a look at the genomeIntervals 
code. As I don’t think that genomeIntervals get as much exposition as 
rtracklayer does, many more people would benefit from an equivalent rtracklayer 
implementation. If you’re interested, I could do a performance comparison - 
based on my usual use case - between both packages.

Nico

---
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---





On 15 Nov 2013, at 18:58, Michael Lawrence lawrence.mich...@gene.com wrote:

 It might be worth taking a look at rtracklayer and the TranscriptDb stuff in 
 GenomicFeatures. It could save you time, and if you notice any deficiencies 
 in rtracklayer, it would help me. For example, if the attribute parsing is a 
 bottleneck, I can push it down to C.
 
 Michael
 
 On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme delho...@embl.de wrote:
 Hej Michael,
 
 Good question really. I have a number of reason for this:
 
 1) I’ve been using the genomeIntervals readGff3 function for that - for years 
 now - and I’ve always been satisfied by its performance, especially when 
 parsing the gff/gtf ninth column. The parseGffAttribute and getGffAttribute 
 functions are extremely convenient. I honestly haven’t checked if there was 
 any recent development in rtracklayer / GenomicFeatures similar to these 
 functions. If there were not, I think they would be a great addition to 
 either package.
 
 2) As you might guess it’s essentially historical, back when I started that 
 package in 2009, there was not today’s fantastic set of packages.
 
 3) As you painfully know, there’s about as many gff format as they are gff 
 files, and because my package is a pipeline I really want to make sure that 
 it’s output is consistent, hence I have strict requirement with regards to 
 the gff/gtf format I accept. Which means that times and again, I have to do 
 slight adjustment but I prefer that over outputting garbage.
 
 4) RNA-Seq analyses are filled with pitfalls, hence I think it is essential 
 that users understand the data formats they handle and actually what these 
 analyses are all about. I don’t want them to use my package as they would use 
 a black box.
 
 5) It’s educational. There’s a vignette that describes how to parse and 
 convert gff/gtf annotation in the minimal gff/gtf formatted file that would 
 suit my package
 
 Well, I suppose it’s more than you asked for, but here are my reasons ;-) 
 You’re welcome to comment and I’d be happy to look again at rtracklayer (been 
 through GenomicFeatures recently and I like it much) if you would advise me 
 so.
 
 Have a nice WE,
 
 Cheers,
 
 Nico
 
 
 ---
 Nicolas Delhomme
 
 Genome Biology Computational Support
 
 European Molecular Biology Laboratory
 
 Tel: +49 6221 387 8310
 Email: nicolas.delho...@embl.de
 Meyerhofstrasse 1 - Postfach 10.2209
 69102 Heidelberg, Germany
 ---
 
 
 
 
 
 On 15 Nov 2013, at 12:44, Michael Lawrence lawrence.mich...@gene.com wrote:
 
  Why not use rtracklayer / GenomicFeatures for parsing GTF? That format is 
  tough; no reason for everyone to take it on by themselves.
 
 
 
 
  On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme delho...@embl.de wrote:
  Hej Natalia!
 
  There were a number of lines in that particular gtf that violated the 
  assumptions I had about EnsEMBL gtf. Not all the fields in the attributes' 
  column were always set and one of the gene name had a space character in 
  it. I’ve made the parsing of gtf file annotation more flexible/lenient and 
  that should resolve that particular issue you had. The changes should 
  propagate in ~2 days to Bioc with easyRNASeq version 1.8.2.
 
  Rather than using the geneModel, which implementation is old and has gotten 
  slow because of changes in the underlying architecture, I prefer an 
  approach where I
  1) filter the gtf / gff annotation file for only those lines I’m interested 
  in (e.g. of type exon, mRNA and gene for a gff file)
  2) collapse every exon of a gene into what I call now a “synthetic 
  transcript”. The reason for changing the naming from geneModel to synthetic 
  transcript is that “gene model” has different meaning depending on the 
 

Re: [Bioc-devel] [BioC] GTF file error when using easyRNAseq

2013-11-15 Thread Michael Lawrence
Doesn't look like genomeIntervals has any C code (?), so a performance
comparison would be interesting. rtracklayer jumps through all sorts of
hoops to handle obscure things like URL encoding in GFF3. The code in
genomeIntervals seems more streamlined.






On Fri, Nov 15, 2013 at 10:14 AM, Nicolas Delhomme delho...@embl.de wrote:

 Took that thread to the devel list, just feels more appropriate with
 regards to the content.

 I already have that on my TODO list :-). This is not up-to-date, i.e. I
 haven’t done the comparison in ~2 years, but last time I did,
 genomeIntervals attribute parsing was faster than rtracklayer equivalent. I
 suppose that’s because it is already implemented in C in genomeIntervals.
 As said I don’t have any actual comparative numbers, still you might want
 to have a look at the genomeIntervals code. As I don’t think that
 genomeIntervals get as much exposition as rtracklayer does, many more
 people would benefit from an equivalent rtracklayer implementation. If
 you’re interested, I could do a performance comparison - based on my usual
 use case - between both packages.

 Nico

 ---
 Nicolas Delhomme

 Genome Biology Computational Support

 European Molecular Biology Laboratory

 Tel: +49 6221 387 8310
 Email: nicolas.delho...@embl.de
 Meyerhofstrasse 1 - Postfach 10.2209
 69102 Heidelberg, Germany
 ---





 On 15 Nov 2013, at 18:58, Michael Lawrence lawrence.mich...@gene.com
 wrote:

  It might be worth taking a look at rtracklayer and the TranscriptDb
 stuff in GenomicFeatures. It could save you time, and if you notice any
 deficiencies in rtracklayer, it would help me. For example, if the
 attribute parsing is a bottleneck, I can push it down to C.
 
  Michael
 
  On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme delho...@embl.de
 wrote:
  Hej Michael,
 
  Good question really. I have a number of reason for this:
 
  1) I’ve been using the genomeIntervals readGff3 function for that - for
 years now - and I’ve always been satisfied by its performance, especially
 when parsing the gff/gtf ninth column. The parseGffAttribute and
 getGffAttribute functions are extremely convenient. I honestly haven’t
 checked if there was any recent development in rtracklayer /
 GenomicFeatures similar to these functions. If there were not, I think they
 would be a great addition to either package.
 
  2) As you might guess it’s essentially historical, back when I started
 that package in 2009, there was not today’s fantastic set of packages.
 
  3) As you painfully know, there’s about as many gff format as they are
 gff files, and because my package is a pipeline I really want to make sure
 that it’s output is consistent, hence I have strict requirement with
 regards to the gff/gtf format I accept. Which means that times and again, I
 have to do slight adjustment but I prefer that over outputting garbage.
 
  4) RNA-Seq analyses are filled with pitfalls, hence I think it is
 essential that users understand the data formats they handle and actually
 what these analyses are all about. I don’t want them to use my package as
 they would use a black box.
 
  5) It’s educational. There’s a vignette that describes how to parse and
 convert gff/gtf annotation in the minimal gff/gtf formatted file that would
 suit my package
 
  Well, I suppose it’s more than you asked for, but here are my reasons
 ;-) You’re welcome to comment and I’d be happy to look again at rtracklayer
 (been through GenomicFeatures recently and I like it much) if you would
 advise me so.
 
  Have a nice WE,
 
  Cheers,
 
  Nico
 
 
  ---
  Nicolas Delhomme
 
  Genome Biology Computational Support
 
  European Molecular Biology Laboratory
 
  Tel: +49 6221 387 8310
  Email: nicolas.delho...@embl.de
  Meyerhofstrasse 1 - Postfach 10.2209
  69102 Heidelberg, Germany
  ---
 
 
 
 
 
  On 15 Nov 2013, at 12:44, Michael Lawrence lawrence.mich...@gene.com
 wrote:
 
   Why not use rtracklayer / GenomicFeatures for parsing GTF? That format
 is tough; no reason for everyone to take it on by themselves.
  
  
  
  
   On Fri, Nov 15, 2013 at 2:40 AM, Nicolas Delhomme delho...@embl.de
 wrote:
   Hej Natalia!
  
   There were a number of lines in that particular gtf that violated the
 assumptions I had about EnsEMBL gtf. Not all the fields in the attributes'
 column were always set and one of the gene name had a space character in
 it. I’ve made the parsing of gtf file annotation more flexible/lenient and
 that should resolve that particular issue you had. The changes should
 propagate in ~2 days to Bioc with easyRNASeq version 1.8.2.
  
   Rather than using the geneModel, which implementation is old and has
 gotten slow because of changes in the underlying architecture, I prefer an
 approach where I
   1) 

Re: [Bioc-devel] [BioC] GTF file error when using easyRNAseq

2013-11-15 Thread Martin Morgan

On 11/15/2013 10:22 AM, Michael Lawrence wrote:

Doesn't look like genomeIntervals has any C code (?), so a performance
comparison would be interesting. rtracklayer jumps through all sorts of
hoops to handle obscure things like URL encoding in GFF3. The code in
genomeIntervals seems more streamlined.


Wanted to mention, and it would be good to know if this was not helpful at all, 
that the Ensembl gtf files are available through AnnotationHub as GRanges objects


 library(AnnotationHub)
 hub = AnnotationHub()
 hub$ensembl.release.73.tab
hub$ensembl.release.73.fasta. ... [378]
hub$ensembl.release.73.gtf. ... [63]
 xx = 
hub$ensembl.release.73.gtf.gallus_gallus.Gallus_gallus.Galgal4.73.gtf_0.0.1.RData

 xx
GRanges with 381368 ranges and 12 metadata columns:
 seqnames   ranges strand   | sourcetype
RleIRanges  Rle   |   factorfactor
   [1]  1 [1735, 2449]  +   | protein_codingexon
   [2]  1 [2379, 2449]  +   | protein_coding CDS
   score phasegene_id  transcript_id
   numeric integercharactercharacter
   [1]  NA  NA ENSGALG0009771 ENSGALT0015891
   [2]  NA 0 ENSGALG0009771 ENSGALT0015891
   exon_number   gene_biotypeexon_id protein_id
 numericcharactercharactercharacter
   [1]   1 protein_coding ENSGALE0301221   NA
   [2]   1 protein_coding   NA ENSGALP0015874
gene_nametranscript_name
  charactercharacter
   [1]   NA   NA
   [2]   NA   NA
 [ reached getOption(max.print) -- omitted 9 rows ]
  ---
  seqlengths:
1  2 ... AADN03010940.1
   NA NA ... NA

Martin








On Fri, Nov 15, 2013 at 10:14 AM, Nicolas Delhomme delho...@embl.de wrote:


Took that thread to the devel list, just feels more appropriate with
regards to the content.

I already have that on my TODO list :-). This is not up-to-date, i.e. I
haven’t done the comparison in ~2 years, but last time I did,
genomeIntervals attribute parsing was faster than rtracklayer equivalent. I
suppose that’s because it is already implemented in C in genomeIntervals.
As said I don’t have any actual comparative numbers, still you might want
to have a look at the genomeIntervals code. As I don’t think that
genomeIntervals get as much exposition as rtracklayer does, many more
people would benefit from an equivalent rtracklayer implementation. If
you’re interested, I could do a performance comparison - based on my usual
use case - between both packages.

Nico

---
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delho...@embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---





On 15 Nov 2013, at 18:58, Michael Lawrence lawrence.mich...@gene.com
wrote:


It might be worth taking a look at rtracklayer and the TranscriptDb

stuff in GenomicFeatures. It could save you time, and if you notice any
deficiencies in rtracklayer, it would help me. For example, if the
attribute parsing is a bottleneck, I can push it down to C.


Michael

On Fri, Nov 15, 2013 at 8:23 AM, Nicolas Delhomme delho...@embl.de

wrote:

Hej Michael,

Good question really. I have a number of reason for this:

1) I’ve been using the genomeIntervals readGff3 function for that - for

years now - and I’ve always been satisfied by its performance, especially
when parsing the gff/gtf ninth column. The parseGffAttribute and
getGffAttribute functions are extremely convenient. I honestly haven’t
checked if there was any recent development in rtracklayer /
GenomicFeatures similar to these functions. If there were not, I think they
would be a great addition to either package.


2) As you might guess it’s essentially historical, back when I started

that package in 2009, there was not today’s fantastic set of packages.


3) As you painfully know, there’s about as many gff format as they are

gff files, and because my package is a pipeline I really want to make sure
that it’s output is consistent, hence I have strict requirement with
regards to the gff/gtf format I accept. Which means that times and again, I
have to do slight adjustment but I prefer that over outputting garbage.


4) RNA-Seq analyses are filled with pitfalls, hence I think it is

essential that users understand the data formats they handle and actually
what these analyses are all about. I don’t want them to use my package as
they would use a black box.


5) It’s educational. There’s a vignette that describes how to parse