Dear Aaron

Thank you. I think it's an important simplification of a potential API when you are saying that what you mostly need are accessors
  m[i, ] and m[, i]
with i scalar or a short contiguous range, such that the value of that could be a relatively small ordinary matrix. (Compared to operations like matrix multiplication, SVD or other decompositions.)

        Wolfgang

PS Loops per se in today's R are not as slow as some think: depending on the algorithm, the time "wasted" by the R interpreter on looking up symbols etc may (or may not) be negligible compared to the actual computations that are done at the C level anyway:

g = function(n) {
    s = 0
    for (i in seq_len(n))
        s = s + i
    s
}

cg = compiler::cmpfun(g)

print(system.time( g(1e6)))
   user  system elapsed
  0.161   0.000   0.161

print(system.time(cg(1e6)))
  user  system elapsed
  0.043   0.000   0.043



2.3.17 20:05, Aaron Lun scripsit:
I'll give two examples from the scran package. In both cases, the count
matrix is such that rows are genes and columns are cells. The first
example involves cell cycle phase assignment (from the cyclone()
function, FYI). Briefly, upon entry to C++, the function:

1. Loops through the cells, one at a time.
2. For each cell, it applies a classifier to the counts for that cell
(i.e., a column of the count matrix). This is not a straightforward
operation and also involves a number of random permutations.
3. Returns a set of scores representing the phase assignment.

For a few cells, I could conceivably move the loop into R and just
supply the column counts for each cell via .Call, which would avoid the
need to interact with the matrix in C++. However, if I were to process
one million cells, the slowness of R's loops would really hurt.

The second example involves normalization using a pooling and
deconvolution algorithm (from the computeSumFactors() function). Upon
entry into C++, the function:

1. Loops through an ordered set of cells.
2. At each cell, the neighbouring set of 20-100 cells defines a sliding
window. Counts for all cells in the window are summed together to create
a pooled expression profile.
3. The pooled profile is used to obtain a size factor, by computing the
median of the ratios between the pool and a pseudo-cell.
4. This is repeated for all cells in the set (i.e., all positions of the
window). Each window corresponds to a pool; the function stores the
identity of the cells in the pool and the size factor for the pool.

The output is used to construct a linear system at the R level, which is
solved to obtain cell-specific size factors. Again, the work done within
the loop is not obviously vectorizable with standard functions.

All of the cases I work with involve processing one row or column at a
time; I generally don't do matrix operations that require random access,
at least not at the C++ level.

Another motivation for moving into C++ is the greater control over
memory management. For a decent number of cells, this can make the
difference between something being runnable or not.

Cheers,

Aaron

On 02/03/17 18:09, Wolfgang Huber wrote:
Aaron

Can you describe use cases, i.e. intended computations on these
matrices, esp. those for which C++ access is needed for?

I'm asking b/c the goals of efficient code and abstraction from how the
data are stored may be conflicting - in which case critical algorithms
may end up circumventing a prematurely defined API.

Wolfgang


25.2.17 00:37, Aaron Lun scripsit:
Yes, I think double-precision would be necessary for general use. Only
the raw count data would be integer, and even then that's not
guaranteed (e.g., if people are using kallisto or salmon for
quantification).


-Aaron


________________________________
From: Vincent Carey <st...@channing.harvard.edu>
Sent: Saturday, 25 February 2017 9:25 AM
To: Aaron Lun
Cc: Tim Triche, Jr.; bioc-devel@r-project.org
Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?

What is the data type for an expression value?  Is it assumed that
double precision will be needed?

On Fri, Feb 24, 2017 at 4:50 PM, Aaron Lun
<a...@wehi.edu.au<mailto:a...@wehi.edu.au>> wrote:
It's a good place to start, though it would be very handy to have a
C(++) API that can be linked against. I'm not sure how much work that
would entail but it would give downstream developers a lot more
options. Sort of like how we can link to Rhtslib, which speeds up a
lot of BAM file processing, instead of just relying on Rsamtools.


-Aaron

________________________________
From: Tim Triche, Jr. <tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>
Sent: Saturday, 25 February 2017 8:34:58 AM
To: Aaron Lun
Cc: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>
Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?

yes

the DelayedArray framework that handles HDF5Array, etc. seems like the
right choice?

--t

On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun
<a...@wehi.edu.au<mailto:a...@wehi.edu.au><mailto:a...@wehi.edu.au<mailto:a...@wehi.edu.au>>>
wrote:
Hi everyone,

I just attended the Human Cell Atlas meeting in Stanford, and people
were talking about gene expression matrices for >1 million cells. If
we assume that we can get non-zero expression profiles for ~5000
genes, we�d be talking about a 5000 x 1 million matrix for the raw
count data. This would be 20-40 GB in size, which would clearly
benefit from sparse (via Matrix) or disk-backed representations
(bigmatrix, BufferedMatrix, rhdf5, etc.).

I�m wondering whether there is any appetite amongst us for making a
consistent BioC API to handle these matrices, sort of like what
BiocParallel does for multicore and snow. It goes without saying that
the different matrix representations should have consistent functions
at the R level (rbind/cbind, etc.) but it would also be nice to have
an integrated C/C++ API (accessible via LinkedTo). There�s many
non-trivial things that can be done with this type of data, and it is
often faster and more memory efficient to do these complex operations
in compiled code.

I was thinking of something that you could supply any supported matrix
representation to a registered function via .Call; the C++ constructor
would recognise the type of matrix during class instantiation; and
operations (row/column/random read access, also possibly various ways
of writing a matrix) would be overloaded and behave as required for
the class. Only the implementation of the API would need to care about
the nitty gritty of each representation, and we would all be free to
write code that actually does the interesting analytical stuff.

Anyway, just throwing some thoughts out there. Any comments appreciated.

Cheers,


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to