Simon,
I found a faster approach to obtaining the duplicate rows by ranges
within space of RangedData objects. Rather than loop over the RangedData
object x, it is faster to loop over the RangesList ranges(x).
> suppressMessages(library(IRanges))
> load("exons0.rda")
> system.time({
+ dupeRows1 <- unlist(sapply(exons0, function(a)
+ duplicated(ranges(a)[[1]])))
+ exons1 <- exons0[dupeRows1, ]
+ })
user system elapsed
12.848 2.368 15.327
> system.time({
+ dupeRows2 <- unlist(lapply(ranges(exons0), duplicated))
+ exons2 <- exons0[dupeRows2, ]
+ })
user system elapsed
3.768 0.616 4.502
> identical(exons1, exons2)
[1] TRUE
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IRanges_1.3.39
Patrick
Patrick Aboyoun wrote:
Simon,
You found a bug. I have fixed it in BioC 2.5 (devel) and it will be
available from bioconductor.org in 24 hours, but you can get it from
svn now.
I think the rdapply function was intended for performing filtering on
RangedData objects, but I haven't used it too much.
> suppressMessages(library(IRanges))
> load("exons0.rda")
> dupeRows <- unlist(sapply(exons0, function(a)
+ duplicated(ranges(a)[[1]])))
> exons1 <- exons0[ dupeRows, ]
> exons1["chr01"]
RangedData: 34312 ranges by 5 columns on 1 sequence
colnames(5): type source phase strand group
names(1): chr01
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IRanges_1.3.38
Simon Anders wrote:
Hi Patrick et al.
do you have any suggestions on the following?
I've got an RangesList object 'exons0' that I created from a GFF file
for the human genome. This GFF file contains all transcripts, and
lists all those exons that appear in multiple transcripts multiple
times. I would like to filter them out.
I was pleased to see that you redefined 'duplicated' for RangedData,
which allowed me to find the rows in the IRanges object that are
duplicates. But how do I prune them?
My first try was this here:
dupeRows <- unlist( sapply( exons0, function(a)
duplicated(ranges(a)[[1]]) ) )
exons1 <- exons0[ dupeRows, ]
This seem to do the job:
> exons0
RangedData: 507249 ranges by 5 columns on 25 sequences
colnames(5): type source phase strand group
names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
chrX chrY
> dupeRows <- unlist( sapply( exons0, function(a)
+ duplicated(ranges(a)[[1]]) ) )
> exons1 <- exons0[ dupeRows, ]
> exons1
RangedData: 253143 ranges by 5 columns on 25 sequences
colnames(5): type source phase strand group
names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
chrX chrY
However, the resulting object behaves oddly. Compare:
> exons0["chr01"]
RangedData: 61840 ranges by 5 columns on 1 sequence
colnames(5): type source phase strand group
names(1): chr01
> exons1["chr01"]
Error in values[i] : mismatching names (and NULL elements not allowed)
What's going on here?
I've now used this command here instead, which does the job, but
looks quite unwieldy and is very slow:
exons <-
do.call( c, unname( lapply( exons0, function(a)
a[ !duplicated( ranges(a)[[1]] ), ] ) ) )
In case you want to try this yourself, you can find the 'exon0'
object here: http://www.ebi.ac.uk/~anders/tmp/exons0.rda
Cheers
Simon
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing