[Bioc-devel] changed 'split' behavior

Martin Morgan Tue, 26 Feb 2013 17:13:57 -0800

Bioc-develers,

Wanted to mention a change to R svn revision 62015 arising from


    https://stat.ethz.ch/pipermail/r-devel/2013-January/065700.html

when the first argument, x, is a class. Previously, the behaviour was like

    ind <- split(seq_along(f), f)
    lapply(ind, function(i) x[i])

but the updated behaviour is like

    ind <- split(seq_along(x), f)
    lapply(ind, function(i) x[i])

with 'x' replacing 'f' in seq_along. This is more consistent with thedocumentation and other parts of the default implementation.

We've seen two types of problem. The first is as in genomeIntervals afterrunning example(interval_union) where we have an object 'i' that represents 7intervals


> i
Object of class Genome_intervals_stranded
7 base intervals and 0 inter-base intervals(*):
i.gene.1 chr01 + [1, 2]
i.gene.2 chr01 + (2, 5)
i.gene.3 chr02 - [11, 12)
i.gene.4 chr02 - [8, 9)
i.gene.5 chr02 - [5, 10]
i.gene.6 chr02 + [4, 12]
i.gene.7 chr03 + [2, 6)

annotation:
  seq_name inter_base strand
1    chr01      FALSE      +
2    chr01      FALSE      +
3    chr02      FALSE      -
4    chr02      FALSE      -
5    chr02      FALSE      -
6    chr02      FALSE      +
7    chr03      FALSE      +

and one wishes to split by strand.

> split(i, strand(i))
Error in x@.Data[i, , drop = FALSE] : subscript out of bounds

This fails because seq_along uses length(i) != length(strand(i)) (length(i)reports 14, because i has a 7 x 2 matrix in its .Data slot)

A solution is to define a length method at an appropriate place in the classhierarchy. Such a method would seem to make 'sense' with the vector-likebehaviour of [ sub-setting on i.


  setMethod(length, "Intervals_virtual", function(x) nrow(x))

> split(i, strand(i))
$`+`
Object of class Genome_intervals_stranded
4 base intervals and 0 inter-base intervals(*):
i.gene.1 chr01 + [1, 2]
i.gene.2 chr01 + (2, 5)
i.gene.6 chr02 + [4, 12]
i.gene.7 chr03 + [2, 6)

annotation:
  seq_name inter_base strand
1    chr01      FALSE      +
2    chr01      FALSE      +
6    chr02      FALSE      +
7    chr03      FALSE      +

$`-`
Object of class Genome_intervals_stranded
3 base intervals and 0 inter-base intervals(*):
i.gene.3 chr02 - [11, 12)
i.gene.4 chr02 - [8, 9)
i.gene.5 chr02 - [5, 10]

annotation:
  seq_name inter_base strand
3    chr02      FALSE      -
4    chr02      FALSE      -
5    chr02      FALSE      -

In a second example, fastseg::fastseg can take an ExpressionSet

> fdata = AnnotatedDataFrame(data.frame(f=c("M", "M", "F", "F")))
> e = ExpressionSet(matrix(1:12, 4), featureData=fdata)

and internally does something like

> split(e, fData(e)$f)
... list of ExpressionSets, but not split by fData(e)$f
Warning message:
In split.default(e, fData(e)$f) :
  data length is not a multiple of split variable

Here ExpressionSet is really more matrix-like than vector-like. length(e)reports 1 (maybe it should report nrow * ncol, like matrix), and it doesn't makesense to write a length method on ExpressionSet that returns the number of rows,or a split,ExpressionSet-method. Instead the most parsimonious solution seems tobe to replace a call to split with a one-liner that recapitulates the originalbehaviour


lapply(split(seq_len(nrow(x)), f=f), function(ind) x[ind, , drop = FALSE])

(split has a 'drop' argument that one might wish to pay attention to).

Perhaps there are similar uses of split in your own code?

Martin


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] changed 'split' behavior

Reply via email to