Hi,

On Oct 28, 2009, at 3:44 PM, Martin Morgan wrote:
<snip>

It could be useful, but you'll need to get it into the shape expected by the GenomicFeatures functions. With specific regard to transcripts(), it will need a 'chrom' column with the chromosome names, 'name' column for the gene name, and 'txStart' and 'txEnd' for the start and end of the transcripts,
using UCSC coordinate conventions.

I'm now thinking that it would be more convenient for the user if there was
a transcripts method on a RangesList object, which could provide the
necessary information. This could be extracted from the RangedData, which rtracklayer can create from your GFF file, with one caveat: if the GFF file
contains a mixture of features (genes, exons, etc) and relies on the
hierarchical features of GFF, it will take more work to get things into the
right shape.

The question then is where would this new functionality belong? The future of the GenomicFeatures package was a bit uncertain, the last time I checked.

The intention is that GenomicFeatures will mature to contain these
sorts of data structures and functions, so that it's easy to create or
retrieve transcript or exon information.
</snip>

I've actually been working on a package of my own that does a lot of this stuff, since I've been working lately with rna-seq data.

It takes an input file similar to what you get from the ucsc table browser with gene, tx/cds -start/-end, exon, etc. columns and whips up an annotation package for the genome/annotation source you are using (info is stored in an SQLite database). You can then query the database for genes of interest by name, chromosome, or bounded (start/ end) chromosome position for further downstream use.

Usually results are returned to the user as lists of gene objects, which the user would have to whip into a data.frame format if that's what's desired.

I'll post some code below to show you what I mean, and see if you all think it's helpful. I'm not sure that it's ready for public consumption, but I guess I just wanted to mention that it was my intention to develop this a bit further in the near future. If it's useful for other people, all the better.

Unfortunately I need to put much of this on hold for the immediate future, since I have to prepare for my upcoming qualifying exam ...

I realize it somehow does similar things to what the GenomicFeatures package might have done, but I needed this exact functionality for my immediate analysis needs, and thought I'd have a go at it. The SQLite database that's generated actually stores gene, transcript, and exon info in RTree indices[1], so ranged queries (like all genes/ transcripts between some start/stop positions) are super fast. It requires a custom compile of SQLite, though

Examples of how I use it are pasted below.

-steve

[1] RTree: http://www.sqlite.org/rtree.html

##############################
# Using with refseq annotations for hg18

R> library(GenomeAnnotations) # name of my library
R> hg18.refseq <- GenomeDB("hg18", "refseq")
R> d1r <- Gene("DICER1", genome=hg18.refseq)

## Getting transcript info

R> transcripts(d1r)
SimpleRangesList instance:
$NM_030621
IRanges instance:
        start      end width
[1]  94622320 94626752  4433
[2]  94627120 94627200    81
[3]  94627296 94627456   161
[4]  94629976 94630248   273
[5]  94631912 94632800   889
[6]  94635872 94636024   153
[7]  94639432 94640216   785
[8]  94641160 94641336   177
[9]  94641768 94641872   105
...       ...      ...   ...
[20] 94660288 94660760   473
[21] 94662672 94662840   169
[22] 94665560 94665720   161
[23] 94666144 94666280   137
[24] 94667600 94667728   129
[25] 94668608 94668768   161
[26] 94669408 94669592   185
[27] 94676752 94676848    97
[28] 94677776 94677808    33

<1 additional element>

## UTR Info
R> getUtr3(d1r)
$NM_030621
IRanges instance:
       start      end width
[1] 94622320 94626587  4268

$NM_177438
IRanges instance:
       start      end width
[1] 94622320 94626587  4268

R> getUtr5(d1r)
$NM_030621
IRanges instance:
       start      end width
[1] 94669548 94669592    45
[2] 94676752 94676848    97
[3] 94677776 94677808    33

$NM_177438
IRanges instance:
       start      end width
[1] 94669548 94669592    45
[2] 94693320 94693512   193


## Get genes on chromosome 1
R> genes.1 <- getGenesOnChromosome(hg18.refseq, 1)
R> names(genes.1)[1:5]
[1] "WASH5P" "LOC100288778" "FAM138F" "FAM138A" "FAM138C"

## Get genes on chr1 between pos 1e6 and 1e7
R> genes.1.sub <- getGenesOnChromosome(hg18.refseq, 1, start=1e6, end=1e7)
R> names(genes.1.sub)[1:5]
[1] "C1orf159" "TTLL10"   "TTLL10"   "TNFRSF18" "TNFRSF18"

#########################################
# or, hg18 aceview annos
R> hg18.ace <- GenomeDB("hg18", "aceview")
R> d1a <- Gene("DICER1", genome=hg18.ace)
R> transcripts(d1a)
SimpleRangesList instance:
$DICER1.cApr07
IRanges instance:
        start      end width
[1]  94622320 94626752  4433
[2]  94627120 94627200    81
[3]  94627296 94627456   161
[4]  94629976 94630248   273
[5]  94631912 94632800   889
[6]  94635872 94636024   153
[7]  94639432 94640216   785
[8]  94641160 94641336   177
[9]  94641768 94641872   105
...       ...      ...   ...
[19] 94653712 94653840   129
[20] 94660288 94660760   473
[21] 94662672 94662840   169
[22] 94665560 94665720   161
[23] 94666144 94666280   137
[24] 94667600 94667728   129
[25] 94668608 94668768   161
[26] 94669408 94669592   185
[27] 94693320 94693512   193

<10 additional elements>

R> getUtr3(d1a)
$DICER1.cApr07
IRanges instance:
       start      end width
[1] 94622320 94626587  4268

$DICER1.bApr07
IRanges instance:
       start      end width
[1] 94622320 94626587  4268

.... etc ...

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
  |  Memorial Sloan-Kettering Cancer Center
  |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to