Hi,
On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
<snip>
It could be useful, but you'll need to get it into the shape
expected by the
GenomicFeatures functions. With specific regard to transcripts(),
it will
need a 'chrom' column with the chromosome names, 'name' column for
the gene
name, and 'txStart' and 'txEnd' for the start and end of the
transcripts,
using UCSC coordinate conventions.
I'm now thinking that it would be more convenient for the user if
there was
a transcripts method on a RangesList object, which could provide the
necessary information. This could be extracted from the RangedData,
which
rtracklayer can create from your GFF file, with one caveat: if the
GFF file
contains a mixture of features (genes, exons, etc) and relies on the
hierarchical features of GFF, it will take more work to get things
into the
right shape.
The question then is where would this new functionality belong? The
future
of the GenomicFeatures package was a bit uncertain, the last time I
checked.
The intention is that GenomicFeatures will mature to contain these
sorts of data structures and functions, so that it's easy to create or
retrieve transcript or exon information.
</snip>
I've actually been working on a package of my own that does a lot of
this stuff, since I've been working lately with rna-seq data.
It takes an input file similar to what you get from the ucsc table
browser with gene, tx/cds -start/-end, exon, etc. columns and whips up
an annotation package for the genome/annotation source you are using
(info is stored in an SQLite database). You can then query the
database for genes of interest by name, chromosome, or bounded (start/
end) chromosome position for further downstream use.
Usually results are returned to the user as lists of gene objects,
which the user would have to whip into a data.frame format if that's
what's desired.
I'll post some code below to show you what I mean, and see if you all
think it's helpful. I'm not sure that it's ready for public
consumption, but I guess I just wanted to mention that it was my
intention to develop this a bit further in the near future. If it's
useful for other people, all the better.
Unfortunately I need to put much of this on hold for the immediate
future, since I have to prepare for my upcoming qualifying exam ...
I realize it somehow does similar things to what the GenomicFeatures
package might have done, but I needed this exact functionality for my
immediate analysis needs, and thought I'd have a go at it. The SQLite
database that's generated actually stores gene, transcript, and exon
info in RTree indices[1], so ranged queries (like all genes/
transcripts between some start/stop positions) are super fast. It
requires a custom compile of SQLite, though
Examples of how I use it are pasted below.
-steve
[1] RTree: http://www.sqlite.org/rtree.html
##############################
# Using with refseq annotations for hg18
R> library(GenomeAnnotations) # name of my library
R> hg18.refseq <- GenomeDB("hg18", "refseq")
R> d1r <- Gene("DICER1", genome=hg18.refseq)
## Getting transcript info
R> transcripts(d1r)
SimpleRangesList instance:
$NM_030621
IRanges instance:
start end width
[1] 94622320 94626752 4433
[2] 94627120 94627200 81
[3] 94627296 94627456 161
[4] 94629976 94630248 273
[5] 94631912 94632800 889
[6] 94635872 94636024 153
[7] 94639432 94640216 785
[8] 94641160 94641336 177
[9] 94641768 94641872 105
... ... ... ...
[20] 94660288 94660760 473
[21] 94662672 94662840 169
[22] 94665560 94665720 161
[23] 94666144 94666280 137
[24] 94667600 94667728 129
[25] 94668608 94668768 161
[26] 94669408 94669592 185
[27] 94676752 94676848 97
[28] 94677776 94677808 33
<1 additional element>
## UTR Info
R> getUtr3(d1r)
$NM_030621
IRanges instance:
start end width
[1] 94622320 94626587 4268
$NM_177438
IRanges instance:
start end width
[1] 94622320 94626587 4268
R> getUtr5(d1r)
$NM_030621
IRanges instance:
start end width
[1] 94669548 94669592 45
[2] 94676752 94676848 97
[3] 94677776 94677808 33
$NM_177438
IRanges instance:
start end width
[1] 94669548 94669592 45
[2] 94693320 94693512 193
## Get genes on chromosome 1
R> genes.1 <- getGenesOnChromosome(hg18.refseq, 1)
R> names(genes.1)[1:5]
[1] "WASH5P" "LOC100288778" "FAM138F" "FAM138A"
"FAM138C"
## Get genes on chr1 between pos 1e6 and 1e7
R> genes.1.sub <- getGenesOnChromosome(hg18.refseq, 1, start=1e6,
end=1e7)
R> names(genes.1.sub)[1:5]
[1] "C1orf159" "TTLL10" "TTLL10" "TNFRSF18" "TNFRSF18"
#########################################
# or, hg18 aceview annos
R> hg18.ace <- GenomeDB("hg18", "aceview")
R> d1a <- Gene("DICER1", genome=hg18.ace)
R> transcripts(d1a)
SimpleRangesList instance:
$DICER1.cApr07
IRanges instance:
start end width
[1] 94622320 94626752 4433
[2] 94627120 94627200 81
[3] 94627296 94627456 161
[4] 94629976 94630248 273
[5] 94631912 94632800 889
[6] 94635872 94636024 153
[7] 94639432 94640216 785
[8] 94641160 94641336 177
[9] 94641768 94641872 105
... ... ... ...
[19] 94653712 94653840 129
[20] 94660288 94660760 473
[21] 94662672 94662840 169
[22] 94665560 94665720 161
[23] 94666144 94666280 137
[24] 94667600 94667728 129
[25] 94668608 94668768 161
[26] 94669408 94669592 185
[27] 94693320 94693512 193
<10 additional elements>
R> getUtr3(d1a)
$DICER1.cApr07
IRanges instance:
start end width
[1] 94622320 94626587 4268
$DICER1.bApr07
IRanges instance:
start end width
[1] 94622320 94626587 4268
.... etc ...
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing