Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Rainer Johannes Wed, 10 Jun 2015 01:49:38 -0700

dear Ludwig,

On 10 Jun 2015, at 10:29, Ludwig Geistlinger 
<ludwig.geistlin...@bio.ifi.lmu.de<mailto:ludwig.geistlin...@bio.ifi.lmu.de>> 
wrote:


Dear Johannes,

one follow-up question/comment on the EnsDb packages:

The reason they escaped my notice (and thus potentially will also others)
is that I expected such packages to be named "^TxDb...".

What actually argues against sticking to existing Bioc vocabulary and
naming eg EnsDb.Hsapiens.v79

TxDb.Hsapiens.Ensembl.hg38.ensGene


the reason is that I defined the EnsDb class which is different from the TxDb 
class (I added some Ensembl specific informations, like gene biotype,  that are 
not covered by the TxDb). I just tried to implement the same functionality as 
for TxDb classes so that EnsDb can be integrated seamlessly in existing TxDb 
based workflow, just using the Ensembl annotations.
The naming convention for such packages is always: <class 
name>.<organism>.<version>, thus it was suggested to me to use the naming 
convention that I'm using at present.
For the version I opted to use the Ensembl version instead of the genome 
version as there are several Ensembl versions for the same genome release and 
the annotations can change sometimes considerably (or at least did in the past) 
from Ensembl version to Ensembl version. This naming also allows to have 
annotation packages from different Ensembl releases installed (or being used in 
the same R-session) and compare gene models between them etc.

(or alternatively, if packages like BSgenome.Hsapiens.NCBI.GRCh38 will
indeed make it in the long run:  TxDb.Hsapiens.Ensembl.GRCh38.ensGene)

This would also have the advantage that genome build and idType could be
inferred right from the package name.


I agree that that would ease the mapping between BSgenome and EnsDb, but, as 
explained above, I would like to stick to the Ensembl version instead. I really 
like the AnnotationHub approach from Martin to get the appropriate DNA sequence 
for an Ensembl release, so, once I figured out what caused the error in the 
previous mail, I'll implement a method that returns the correct DNA sequence 
object for a given EnsDb package.

cheers, jo

Best,
Ludwig


dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute "entrezid" that is stored in the
internal database there is however no link to NCBI or UCSC annotations.
So, basically, if you want to use "pure" Ensembl based annotations: use
EnsDb, if you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the
ensembldb package provides functionality to generate such packages either
using the Ensembl Perl API or using GTF files provided by Ensembl. If you
have problems building the packages, just drop me a line and I'll do
that.

cheers, jo

On 03 Jun 2015, at 15:56, Robert M. Flight 
<rfligh...@gmail.com<mailto:rfligh...@gmail.com>> wrote:

Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what
is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885

If instead of "genes" you do "transcripts", you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
ludwig.geistlin...@bio.ifi.lmu.de> wrote:

Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG00000012048; entrez:672

via

genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))

gives me:

GRanges object with 1 range and 1 metadata column:
    seqnames               ranges strand |     gene_id
       <Rle>            <IRanges>  <Rle> | <character>
672    chr17 [43044295, 43170403]      - |         672
-------
seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048
http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object?
(I am aware of biomaRt, but I want to explicitely use the Bioc
annotation
functionality).

Thanks!
Ludwig


--
Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik
Institut für Informatik
Ludwig-Maximilians-Universität München
Amalienstrasse 17, 2. Stock, Büro A201
80333 München

Tel.: 089-2180-4067
eMail: ludwig.geistlin...@bio.ifi.lmu.de

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Reply via email to