Simon,
I found a faster approach to obtaining the duplicate rows by ranges within space of RangedData objects. Rather than loop over the RangedData object x, it is faster to loop over the RangesList ranges(x).

> suppressMessages(library(IRanges))
> load("exons0.rda")

> system.time({
+ dupeRows1 <- unlist(sapply(exons0, function(a)
+    duplicated(ranges(a)[[1]])))
+ exons1 <- exons0[dupeRows1, ]
+ })
  user  system elapsed
12.848   2.368  15.327

> system.time({
+ dupeRows2 <- unlist(lapply(ranges(exons0), duplicated))
+ exons2 <- exons0[dupeRows2, ]
+ })
  user  system elapsed
 3.768   0.616   4.502

> identical(exons1, exons2)
[1] TRUE

> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] IRanges_1.3.39



Patrick



Patrick Aboyoun wrote:
Simon,
You found a bug. I have fixed it in BioC 2.5 (devel) and it will be available from bioconductor.org in 24 hours, but you can get it from svn now.

I think the rdapply function was intended for performing filtering on RangedData objects, but I haven't used it too much.


> suppressMessages(library(IRanges))
> load("exons0.rda")
> dupeRows <- unlist(sapply(exons0, function(a)
+   duplicated(ranges(a)[[1]])))
> exons1 <- exons0[ dupeRows, ]
> exons1["chr01"]
RangedData: 34312 ranges by 5 columns on 1 sequence
colnames(5): type source phase strand group
names(1): chr01
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base other attached packages:
[1] IRanges_1.3.38




Simon Anders wrote:
Hi Patrick et al.

do you have any suggestions on the following?

I've got an RangesList object 'exons0' that I created from a GFF file for the human genome. This GFF file contains all transcripts, and lists all those exons that appear in multiple transcripts multiple times. I would like to filter them out.

I was pleased to see that you redefined 'duplicated' for RangedData, which allowed me to find the rows in the IRanges object that are duplicates. But how do I prune them?

My first try was this here:

  dupeRows <- unlist( sapply( exons0, function(a)
     duplicated(ranges(a)[[1]]) ) )
  exons1 <- exons0[ dupeRows, ]

This seem to do the job:

  > exons0
  RangedData: 507249 ranges by 5 columns on 25 sequences
  colnames(5): type source phase strand group
  names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
  chrX chrY

  > dupeRows <- unlist( sapply( exons0, function(a)
  +    duplicated(ranges(a)[[1]]) ) )
  > exons1 <- exons0[ dupeRows, ]

  > exons1
  RangedData: 253143 ranges by 5 columns on 25 sequences
  colnames(5): type source phase strand group
  names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
  chrX chrY

However, the resulting object behaves oddly. Compare:

  > exons0["chr01"]
  RangedData: 61840 ranges by 5 columns on 1 sequence
  colnames(5): type source phase strand group
  names(1): chr01

  > exons1["chr01"]
  Error in values[i] : mismatching names (and NULL elements not allowed)

What's going on here?


I've now used this command here instead, which does the job, but looks quite unwieldy and is very slow:

  exons <-
  do.call( c, unname( lapply( exons0, function(a)
     a[ !duplicated( ranges(a)[[1]] ), ] ) ) )


In case you want to try this yourself, you can find the 'exon0' object here: http://www.ebi.ac.uk/~anders/tmp/exons0.rda


Cheers
  Simon

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Reply via email to