Re: [Bioc-sig-seq] GenomicFeatures package

Steve Lianoglou Thu, 29 Oct 2009 08:36:57 -0700

Hi,

On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
<snip>

It could be useful, but you'll need to get it into the shapeexpected by theGenomicFeatures functions. With specific regard to transcripts(),it willneed a 'chrom' column with the chromosome names, 'name' column forthe genename, and 'txStart' and 'txEnd' for the start and end of thetranscripts,
using UCSC coordinate conventions.
I'm now thinking that it would be more convenient for the user ifthere was
a transcripts method on a RangesList object, which could provide the
necessary information. This could be extracted from the RangedData,whichrtracklayer can create from your GFF file, with one caveat: if theGFF file
contains a mixture of features (genes, exons, etc) and relies on the
hierarchical features of GFF, it will take more work to get thingsinto the
right shape.
The question then is where would this new functionality belong? Thefutureof the GenomicFeatures package was a bit uncertain, the last time Ichecked.
The intention is that GenomicFeatures will mature to contain these
sorts of data structures and functions, so that it's easy to create or
retrieve transcript or exon information.

</snip>

I've actually been working on a package of my own that does a lot ofthis stuff, since I've been working lately with rna-seq data.

It takes an input file similar to what you get from the ucsc tablebrowser with gene, tx/cds -start/-end, exon, etc. columns and whips upan annotation package for the genome/annotation source you are using(info is stored in an SQLite database). You can then query thedatabase for genes of interest by name, chromosome, or bounded (start/end) chromosome position for further downstream use.

Usually results are returned to the user as lists of gene objects,which the user would have to whip into a data.frame format if that'swhat's desired.

I'll post some code below to show you what I mean, and see if you allthink it's helpful. I'm not sure that it's ready for publicconsumption, but I guess I just wanted to mention that it was myintention to develop this a bit further in the near future. If it'suseful for other people, all the better.

Unfortunately I need to put much of this on hold for the immediatefuture, since I have to prepare for my upcoming qualifying exam ...

I realize it somehow does similar things to what the GenomicFeaturespackage might have done, but I needed this exact functionality for myimmediate analysis needs, and thought I'd have a go at it. The SQLitedatabase that's generated actually stores gene, transcript, and exoninfo in RTree indices[1], so ranged queries (like all genes/transcripts between some start/stop positions) are super fast. Itrequires a custom compile of SQLite, though


Examples of how I use it are pasted below.

-steve

[1] RTree: http://www.sqlite.org/rtree.html

##############################
# Using with refseq annotations for hg18

R> library(GenomeAnnotations) # name of my library
R> hg18.refseq <- GenomeDB("hg18", "refseq")
R> d1r <- Gene("DICER1", genome=hg18.refseq)

## Getting transcript info

R> transcripts(d1r)
SimpleRangesList instance:
$NM_030621
IRanges instance:
        start      end width
[1]  94622320 94626752  4433
[2]  94627120 94627200    81
[3]  94627296 94627456   161
[4]  94629976 94630248   273
[5]  94631912 94632800   889
[6]  94635872 94636024   153
[7]  94639432 94640216   785
[8]  94641160 94641336   177
[9]  94641768 94641872   105
...       ...      ...   ...
[20] 94660288 94660760   473
[21] 94662672 94662840   169
[22] 94665560 94665720   161
[23] 94666144 94666280   137
[24] 94667600 94667728   129
[25] 94668608 94668768   161
[26] 94669408 94669592   185
[27] 94676752 94676848    97
[28] 94677776 94677808    33

<1 additional element>

## UTR Info
R> getUtr3(d1r)
$NM_030621
IRanges instance:
       start      end width
[1] 94622320 94626587  4268

$NM_177438
IRanges instance:
       start      end width
[1] 94622320 94626587  4268

R> getUtr5(d1r)
$NM_030621
IRanges instance:
       start      end width
[1] 94669548 94669592    45
[2] 94676752 94676848    97
[3] 94677776 94677808    33

$NM_177438
IRanges instance:
       start      end width
[1] 94669548 94669592    45
[2] 94693320 94693512   193


## Get genes on chromosome 1
R> genes.1 <- getGenesOnChromosome(hg18.refseq, 1)
R> names(genes.1)[1:5]

[1] "WASH5P" "LOC100288778" "FAM138F" "FAM138A""FAM138C"


## Get genes on chr1 between pos 1e6 and 1e7

R> genes.1.sub <- getGenesOnChromosome(hg18.refseq, 1, start=1e6,end=1e7)

R> names(genes.1.sub)[1:5]
[1] "C1orf159" "TTLL10"   "TTLL10"   "TNFRSF18" "TNFRSF18"

#########################################
# or, hg18 aceview annos
R> hg18.ace <- GenomeDB("hg18", "aceview")
R> d1a <- Gene("DICER1", genome=hg18.ace)
R> transcripts(d1a)
SimpleRangesList instance:
$DICER1.cApr07
IRanges instance:
        start      end width
[1]  94622320 94626752  4433
[2]  94627120 94627200    81
[3]  94627296 94627456   161
[4]  94629976 94630248   273
[5]  94631912 94632800   889
[6]  94635872 94636024   153
[7]  94639432 94640216   785
[8]  94641160 94641336   177
[9]  94641768 94641872   105
...       ...      ...   ...
[19] 94653712 94653840   129
[20] 94660288 94660760   473
[21] 94662672 94662840   169
[22] 94665560 94665720   161
[23] 94666144 94666280   137
[24] 94667600 94667728   129
[25] 94668608 94668768   161
[26] 94669408 94669592   185
[27] 94693320 94693512   193

<10 additional elements>

R> getUtr3(d1a)
$DICER1.cApr07
IRanges instance:
       start      end width
[1] 94622320 94626587  4268

$DICER1.bApr07
IRanges instance:
       start      end width
[1] 94622320 94626587  4268

.... etc ...

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
  |  Memorial Sloan-Kettering Cancer Center
  |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] GenomicFeatures package

Reply via email to