Re: [Bioc-sig-seq] Pruning rows from a RangedData object

Patrick Aboyoun Sat, 18 Jul 2009 09:11:53 -0700

Simon,

I found a faster approach to obtaining the duplicate rows by rangeswithin space of RangedData objects. Rather than loop over the RangedDataobject x, it is faster to loop over the RangesList ranges(x).


> suppressMessages(library(IRanges))
> load("exons0.rda")

> system.time({
+ dupeRows1 <- unlist(sapply(exons0, function(a)
+    duplicated(ranges(a)[[1]])))
+ exons1 <- exons0[dupeRows1, ]
+ })
  user  system elapsed
12.848   2.368  15.327

> system.time({
+ dupeRows2 <- unlist(lapply(ranges(exons0), duplicated))
+ exons2 <- exons0[dupeRows2, ]
+ })
  user  system elapsed
 3.768   0.616   4.502

> identical(exons1, exons2)
[1] TRUE

> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] IRanges_1.3.39



Patrick



Patrick Aboyoun wrote:

Simon,
You found a bug. I have fixed it in BioC 2.5 (devel) and it will beavailable from bioconductor.org in 24 hours, but you can get it fromsvn now.
I think the rdapply function was intended for performing filtering onRangedData objects, but I haven't used it too much.
> suppressMessages(library(IRanges))
> load("exons0.rda")
> dupeRows <- unlist(sapply(exons0, function(a)
+   duplicated(ranges(a)[[1]])))
> exons1 <- exons0[ dupeRows, ]
> exons1["chr01"]
RangedData: 34312 ranges by 5 columns on 1 sequence
colnames(5): type source phase strand group
names(1): chr01
> sessionInfo()
R version 2.10.0 Under development (unstable) (2009-06-28 r48863)
i386-apple-darwin9.7.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods baseother attached packages:
[1] IRanges_1.3.38




Simon Anders wrote:
Hi Patrick et al.

do you have any suggestions on the following?
I've got an RangesList object 'exons0' that I created from a GFF filefor the human genome. This GFF file contains all transcripts, andlists all those exons that appear in multiple transcripts multipletimes. I would like to filter them out.
I was pleased to see that you redefined 'duplicated' for RangedData,which allowed me to find the rows in the IRanges object that areduplicates. But how do I prune them?
My first try was this here:

  dupeRows <- unlist( sapply( exons0, function(a)
     duplicated(ranges(a)[[1]]) ) )
  exons1 <- exons0[ dupeRows, ]

This seem to do the job:

  > exons0
  RangedData: 507249 ranges by 5 columns on 25 sequences
  colnames(5): type source phase strand group
  names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
  chrX chrY

  > dupeRows <- unlist( sapply( exons0, function(a)
  +    duplicated(ranges(a)[[1]]) ) )
  > exons1 <- exons0[ dupeRows, ]

  > exons1
  RangedData: 253143 ranges by 5 columns on 25 sequences
  colnames(5): type source phase strand group
  names(25): chr01 chr02 chr03 chr04 chr05 chr06 ... chr21 chr22 chrMT
  chrX chrY

However, the resulting object behaves oddly. Compare:

  > exons0["chr01"]
  RangedData: 61840 ranges by 5 columns on 1 sequence
  colnames(5): type source phase strand group
  names(1): chr01

  > exons1["chr01"]
  Error in values[i] : mismatching names (and NULL elements not allowed)

What's going on here?
I've now used this command here instead, which does the job, butlooks quite unwieldy and is very slow:
  exons <-
  do.call( c, unname( lapply( exons0, function(a)
     a[ !duplicated( ranges(a)[[1]] ), ] ) ) )
In case you want to try this yourself, you can find the 'exon0'object here: http://www.ebi.ac.uk/~anders/tmp/exons0.rda
Cheers
  Simon

_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing


_______________________________________________
Bioc-sig-sequencing mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

Re: [Bioc-sig-seq] Pruning rows from a RangedData object

Reply via email to