Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Rainer Johannes Wed, 10 Jun 2015 02:54:30 -0700

update:

seems something is strange with the index... when I load the ensembl-77 DNA for 
mouse I get two files, the fasta file and the index and I can extract the 
sequences, while in the example below I just got the fasta file


On 10 Jun 2015, at 10:11, Rainer Johannes 
<johannes.rai...@eurac.edu<mailto:johannes.rai...@eurac.edu>> wrote:

Dear Martin,

the AnnotationHub approach looks awesome! However, somehow it does not work for 
me, I always get an error:

library(AnnotationHub)
library(Rsamtools)
library(GenomicFeatures)
ah <- AnnotationHub()

## somehow I don't see DNA sequences for release-80... thus using 75
query(ah, c("Takifugu", "release-75"))

## load the gtf and the dna.toplevel.fa
gtf <- ah[["AH10717"]]
dna <- ah[["AH20637"]]

## create txdb and get exons:
txdb <- makeTxDbFromGRanges(gtf)
exons <- exons(txdb)

## getting the sequences
getSeq(dna, exons)

and I get the following error:
Error in value[[3L]](cond) : 'open' index failed
  file: /Users/jo/~/.AnnotationHub/24732
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  [fai_load] fail to open FASTA index.

## creating the index
indexFa(dna)

## trying again:
getSeq(dna, exons)
Error in value[[3L]](cond) :  record 1 (MT:1-68) was truncated
  file: /Users/jo/~/.AnnotationHub/24732

This one is from the current R-devel, but I get the same error with the stable 
version... any idea?

cheers, jo


my session info:
> sessionInfo()
R Under development (unstable) (2015-06-06 r68485)
Platform: x86_64-apple-darwin14.4.0/x86_64 (64-bit)
Running under: OS X 10.10.4 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] GenomicFeatures_1.20.1 AnnotationDbi_1.30.1   Biobase_2.28.0
 [4] Rsamtools_1.20.4       Biostrings_2.36.1      XVector_0.8.0
 [7] GenomicRanges_1.20.4   GenomeInfoDb_1.4.0     IRanges_2.2.3
[10] S4Vectors_0.6.0        BiocGenerics_0.14.0    AnnotationHub_2.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6                  magrittr_1.5
 [3] GenomicAlignments_1.4.1      zlibbioc_1.14.0
 [5] BiocParallel_1.2.2           xtable_1.7-4
 [7] R6_2.0.1                     stringr_1.0.0
 [9] httr_0.6.1                   tools_3.3.0
[11] DBI_0.3.1                    lambda.r_1.1.7
[13] futile.logger_1.4.1          htmltools_0.2.6
[15] digest_0.6.8                 interactiveDisplayBase_1.6.0
[17] shiny_0.12.0                 rtracklayer_1.28.4
[19] futile.options_1.0.0         bitops_1.0-6
[21] biomaRt_2.24.0               RCurl_1.95-4.6
[23] RSQLite_1.0.0                mime_0.3
[25] stringi_0.4-1                BiocInstaller_1.18.2
[27] XML_3.98-1.2                 httpuv_1.3.2


On 09 Jun 2015, at 16:25, Martin Morgan 
<mtmor...@fredhutch.org<mailto:mtmor...@fredhutch.org>> wrote:

On 06/08/2015 11:43 PM, Rainer Johannes wrote:
dear Robert and Ludwig,

the EnsDb packages provide all the gene/transcript etc annotations for all
genes defined in the Ensembl database (for a given species and Ensembl
release). Except the column/attribute "entrezid" that is stored in the
internal database there is however no link to NCBI or UCSC annotations. So,
basically, if you want to use "pure" Ensembl based annotations: use EnsDb, if
you want to have the UCSC annotations: use the TxDb packages.

In case you need EnsDbs of other species or Ensembl versions, the ensembldb
package provides functionality to generate such packages either using the
Ensembl Perl API or using GTF files provided by Ensembl. If you have problems
building the packages, just drop me a line and I'll do that.

Two other sources of Ensembl TxDb's are GenomicFeatures::makeTxDbFromBiomart() 
and AnnotationHub. For the latter, I'll add a variant of the following to the 
AnnotationHub HOWTO vignette 
http://bioconductor.org/packages/devel/bioc/html/AnnotationHub.html later today.

## Gene models

_Bioconductor_ represents gene models using 'transcript'
databases. These are available via packages such as
[TxDb.Hsapiens.UCSC.hg38.knownGene](http://bioconductor.org/packages/TxDb.Hsapiens.UCSC.knownGene.html),
or can be constructed using functions such as
`[GenomicFeatures](http://bioconductor.org/packages/GenomicFeatures.html)::makeTxDbFromBiomart()`
 or `GenomicFeatures::makeTxDbFromGRanges()`.

_AnnotationHub_ provides an easy way to work with gene models
published by Ensembl. Here we discover the Ensemble release 80 r
esources for pufferfish,_Takifugu rubripes_

```{r takifugu-gene-models}
query(ah, c("Takifugu", "release-80"))
```

We see that there is a GTF file, as well as various DNA
sequences. Let's retrieve the GTF and top-level sequence files. The
GTF file is imported as a _GRanges_ instance, the DNA sequence as a
compressed, indexed Fasta file


```{r takifugi-data}
gtf <- ah[["AH47101"]]
dna <- ah[["AH47477"]]

head(gtf, 3)
dna
head(seqlevels(dna))
```

It is trivial to make a TxDb instance

```{r takifugi-txdb}
library(GenomicFeatures)
txdb <- makeTxDbFromGRanges(gtf)
````

and to use that in conjunction with the DNA sequence, e.g., to find
exon sequences of all annotated genes.

```{r takifugi-exons}
library(Rsamtools)     # for getSeq,FaFile-method
exons <- exons(txdb)
getSeq(dna, exons)
```

Some difficulties arise when working with this partly assembled genome
that require more advanced GenomicRanges skills, see the
[GenomicRanges](http://bioconductor.org/packages/GenomicRanges.html)
vignettes, especially "GenomicRanges HOWTOs" and "An Introduction to
GenomicRanges".



cheers, jo

On 03 Jun 2015, at 15:56, Robert M. Flight 
<rfligh...@gmail.com<mailto:rfligh...@gmail.com>> wrote:

Ludwig,

If you do this search on the UCSC genome browser (which this annotation
package is built from), you will see that the longest variant is what is
shown

http://genome.ucsc.edu/cgi-bin/hgTracks?clade=mammal&org=Human&db=hg38&position=brca1&hgt.positionInput=brca1&hgt.suggestTrack=knownGene&Submit=submit&hgsid=429339723_8sd4QD2jSAnAsa6cVCevtoOy4GAz&pix=1885



If instead of "genes" you do "transcripts", you will see 20 different
transcripts for this gene, including the one listed by NCBI.

I havent tried it yet (haven't upgraded R or bioconductor to latest
version), but there is now an Ensembl based annotation package as well,
that may work better??
http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html



-Robert



On Wed, Jun 3, 2015 at 7:04 AM Ludwig Geistlinger <
ludwig.geistlin...@bio.ifi.lmu.de<mailto:ludwig.geistlin...@bio.ifi.lmu.de>> 
wrote:

Dear Bioc annotation team,

Querying TxDb.Hsapiens.UCSC.hg38.knownGene for gene coordinates, e.g.
for

BRCA1; ENSG00000012048; entrez:672

via

genes(TxDb.Hsapiens.UCSC.hg38.knownGene, vals=list(gene_id="672"))

gives me:

GRanges object with 1 range and 1 metadata column: seqnames
ranges strand |     gene_id <Rle>            <IRanges>  <Rle> |
<character> 672    chr17 [43044295, 43170403]      - |         672
------- seqinfo: 455 sequences (1 circular) from hg38 genome


However, querying Ensembl and NCBI Gene
http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000012048


http://www.ncbi.nlm.nih.gov/gene/672

the gene is located at (note the difference in the end position)

Chromosome 17: 43,044,295-43,125,483 reverse strand


How is the inconsistency explained and how to extract an ENSEMBL/NCBI
conform annotation from the TxDb object? (I am aware of biomaRt, but I
want to explicitely use the Bioc annotation functionality).

Thanks! Ludwig


-- Dipl.-Bioinf. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik Institut für Informatik
Ludwig-Maximilians-Universität München Amalienstrasse 17, 2. Stock, Büro
A201 80333 München

Tel.: 089-2180-4067 eMail: 
ludwig.geistlin...@bio.ifi.lmu.de<mailto:ludwig.geistlin...@bio.ifi.lmu.de>

_______________________________________________ 
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>
mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel


[[alternative HTML version deleted]]

_______________________________________________ 
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>
mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________ 
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>
mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] Gene annotation: TxDb vs ENSEMBL/NCBI inconsistency

Reply via email to