Actually, adding making by accept "tx_name" would be cleaner.
exonsBy(txdb, by = c("tx", "tx_name", "gene"))
cdsBy(txdb, by = c("tx", "tx_name", "gene"))
Patrick
On 8/4/10 2:37 PM, Patrick Aboyoun wrote:
That being said, we could still add a flag (useTxNames=FALSE?) to the
exonsBy function that, if TRUE, would check to see if the transcript
names are suitable, and if they are suitable, it would return the
output you are looking for, and if they are not suitable, the function
would throw and error stating the transcript names cannot be used for
this purpose. For Steve use case, using transcript names would work
just fine:
> txdb <- makeTranscriptDbFromUCSC(genome='hg18', tablename='ensGene')
> tx <- transcripts(txdb)
> length(tx)
[1] 63280
> table(nchar(values(tx)[["tx_name"]]))
15
63280
> table(duplicated(values(tx)[["tx_name"]]))
FALSE
63280
Cheers,
Patrick
On 8/4/10 10:39 AM, Marc Carlson wrote:
Hi Steve,
Patrick is correct about the state of transcript IDs. We require them
to be unique, and since we cannot always rely on them being unique and
assigned from some sources, we have to assign them ourselves internally
to be certain. You can still recover the original names using the
transcripts() method (which would be the xscripts object in your
example), and then you would know which of those map based on the
tx_ids.
When we initially mocked this up we were using the transcript IDs
presented by the sources, but we ran into too many circumstances where
we could not rely on them to be unique. The present solution is
slightly less fun to use, but safer for the end user. Internally, we
are using the tx_id as a primary key and so the user can have a better
guarantee that it will behave itself than if we tried to use the other
IDs. A similar thing had to be done for exon ids and cds ids since not
everyone even names those with an ID. Sometimes all your get is a
unique combination of ranges associated with a gene ID.
Marc
On 08/04/2010 08:38 AM, Patrick Aboyoun wrote:
Steve,
Herve has better knowledge of the transcript names key than I do since
he was dealing with the pre-processed data from UCSC and BioMart, but
it is my understanding that there is no guarantee that the transcript
names assigned by these resources are unique or even exist so using
the transcript names can be problematic. We could add a flag
(useTxNames=FALSE?) to the exonsBy function that, if TRUE, would check
to see if the transcript names are suitable, and if they are suitable,
it would return the output you are looking for, and if they are not
suitable, the function would throw and error stating the transcript
names cannot be used for this purpose. Thoughts?
Cheers,
Patrick
On 8/4/10 6:20 AM, Steve Lianoglou wrote:
Howdy,
For what it's worth, I'm trying to hack on GenomicFeatures in a
non-intrusive so that I can (i) wrap some functionality in a way that
I think is useful/intuitive, and (ii) hopefully contribute back to the
package itself.
I addressed the "problem" in my original email by tweaking the
.featuresBy function here:
http://github.com/lianos/GenomicFeaturesX/blob/master/R/transcriptsBy.R#L56
(around lines 56, and again 93)
As I'm a noob with this package, I'm not sure if it will aversely
effect anything else down stream, but it's doing what I need for now,
so that exonsBy(txdb, 'tx') now has the transcript id's in the names()
slot of the GRangesList that is returned.
If it pleases the court, I'd also like to S4-ize the `exons` and
`transcripts` functions (which currently work on TranscriptDb's) since
I'd like to use them as accessors for my Gene-like objects:
http://github.com/lianos/GenomicFeaturesX/blob/master/R/Gene-class.R
Cheers,
-steve
On Tue, Aug 3, 2010 at 7:33 PM, Steve Lianoglou
<[email protected]> wrote:
Hi,
I'm seeing if I can transition some of my code from my
(previously-mentioned) GenomeAnnotations package to use
GenomicFeatures, so I have a few questions of how certain things
should be done w/ GenomicFeatures.
I'll start with this one:
Shouldn't the GRangesList returned by exonsBy(txdb, 'tx') have more
informative names than:
"1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
?
I've made a TranscriptDb from the 'ensembl' gene annos, and I'd like
to reconcile which "transcript exons" belong to which transcripts,
but
it seems there's no direct link from the GRangesList returned by
exonsBy(..., 'tx') and the GRanges object returned by
transcripts(txdb).
Shouldn't there be? One would expect a 1:1 map, no? The length of the
exonsBy/GRangesList is the same as transcripts/GRanges object, so ...
yes :-).
I mean:
R> txdb<- makeTranscriptDbFromUCSC(genome='hg18',
tablename='ensGene')
R> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg18
| UCSC Table: ensGene
| Type of Gene ID: Ensembl gene ID
| Full dataset: yes
| transcript_nrow: 63280
| exon_nrow: 276075
| cds_nrow: 225373
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2010-08-03 16:50:06 -0400 (Tue, 03 Aug 2010)
| GenomicFeatures version at creation time: 1.0.6
| RSQLite version at creation time: 0.9-2
R> xcripts<- transcripts(txdb)
R> xcripts
xcripts
GRanges with 63280 ranges and 2 elementMetadata values
seqnames ranges strand | tx_id
tx_name
<Rle> <IRanges> <Rle>
|<integer> <character>
[1] chr1 [ 1873, 3533] + | 1730
ENST00000404059
[2] chr1 [ 20229, 20366] + | 1732
ENST00000408384
...
R> xcripts.exons<- exonsBy(txdb, 'tx')
R> head(names(xcripts.exons))
[1] "1" "2" "3" "4" "5" "6"
I feel like names(xcripts.exons) should give me something like:
"ENST00000404059" "ENST00000408384" "ENST00000359752" .... (in
correct
order, of course)
My same confusion has to do lack of informative names returned from
exonsBy(txdb, 'gene') -- I feel like it should set the names in the
same way that transcriptsBy(txdb, 'gene').
So, I wonder if this is just an oversight, or is it not there on
purpose and I have to rethink the way I approach these problems.
Should I be findOverlap-ing between my xcripts.exons GRangesList and
my xcripts GRanges, or something? And if so, why is that better?
R> sessionInfo()
R version 2.12.0 Under development (unstable) (2010-06-28 r52408)
Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RSQLite_0.9-2 DBI_0.2-5 GenomicFeatures_1.1.6
GenomicRanges_1.1.20
[5] IRanges_1.7.15
loaded via a namespace (and not attached):
[1] Biobase_2.9.0 biomaRt_2.5.1 Biostrings_2.17.27
BSgenome_1.17.6 RCurl_1.4-3
[6] rtracklayer_1.9.4 tools_2.12.0 XML_3.1-0
Thanks,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing