Re: [Bioc-devel] any interest in a BiocMatrix core package?

Martin Morgan Wed, 01 Nov 2017 14:27:08 -0700

On 11/01/2017 05:15 PM, Bemis, Kylie wrote:

Yes, the ideal solution seems rather unlikely, but I feel like there must be a 
solution better than the current situation.


I’d like to implement some more of the functionality from matrixStats for 
‘matter’ matrices, but importing DelayedArray and DelayedMatrixStats solely for 
the generic functions seems like a bit much. Is that the best thing to do 
though?

It would be very helpful to have this on CRAN, and for matrixStats (andMatrix) to play along.


Martin


Any suggestions?

-Kylie

On Nov 1, 2017, at 4:59 PM, Hervé Pagès <hpa...@fredhutch.org> wrote:

That's probably a good idea but a clean solution would need to
involve all players, including the Matrix package. Right now there
are conflicts for some S4 generics defined in Matrix and in
BiocGenerics (e.g. rowSums). I'm not sure that moving rowSums from
BiocGenerics to a new MatrixGenerics package would address this.
Unless MatrixGenerics is on CRAN and Matrix depends on it ;-)

How likely is this to happen?

H.

On 11/01/2017 01:44 PM, Peter Hickey wrote:

I think that's a good idea, Kylie.
Pete (DelayedMatrixStats developer)

On Thu., 2 Nov. 2017, 6:09 am Kasper Daniel Hansen, <
kasperdanielhan...@gmail.com> wrote:

I think it makes sense. A lot of sense. Might be useful to involve Henrik
(matrixStats) as well.

Who are the players, apart from DelayedArray/DelayedMatrixStats and matter?
(and some very old stuff in Biobase which should really be deprecated in
favor of matrixStats).

Best,
Kasper

On Wed, Nov 1, 2017 at 3:03 PM, Bemis, Kylie <k.be...@northeastern.edu>
wrote:

Hi all,

To continue a variant of this conversation, with the latest BioC release,
we now have quite a few packages that are implementing various
matrix-related S4 generic functions, many of them relying on matrixStats

as

a template.

I was wondering if there is any interest or intention to create a common
MatrixGenerics/ArrayGenerics package on which we can depend to import the
relevant S4 generic functions. Although BiocGeneric has a few like
‘rowSums()’ and ‘colMeans()’, etc., there are many more that are
implemented across ‘DelayedArray', ‘DelayedMatrixStats', my own package
‘matter', etc., including ‘apply()’, ‘rowSds()’, ‘colVars()’, and so

forth.


It would be nice to have a single package with minimal additional
dependencies (a la BiocGenerics) where we could import the various S4
generics and avoid unwanted namespace collisions.

Have there been any thoughts on this?

Many thanks,
Kylie

~~~
Kylie Ariel Bemis
Future Faculty Fellow
College of Computer and Information Science
Northeastern University
kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__kuwisdelu.github.io&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=jvekQlr-c1DbU0g-P5b_FApuAd33vBk3IMDG5F_slQo&e=>




On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>>

wrote:




On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey <

st...@channing.harvard.edu

<mailto:st...@channing.harvard.edu>> wrote:


On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>>

wrote:

Some comment on Aaron's stuff

One possibility for doing things like this is if your code can be done in
C++ using a subset of rows or columns.  That can sometimes give the
necessary speed up.  What I mean is this

Say you can safely process 1000 cells (not matrix cells, but biological
cells, aka columns) at a time in RAM

iterate in R:
   get chunk i containing 1000 cells from the backend data storage
   do something on this sub matrix where everything is in a normal matrix
and you just use C++
   write results out to whatever backend you're using

Then, with a million cells you iterate over 1000 chunks in R.  And you
don't need to "touch" the full dataset which can be stored on an

arbitrary

backend.

you "touch" it, but you never ingest the whole thing at any time, is that
what you mean?

Yes, you load the chunk into RAM and then just deal with it.

Think of doing 10^10 linear models.  If this was 10^6 I would just use
lmFit.  But 10^10 doesn't fit into memory.  So I load 10^7 into memory,

run

lmFit, store results, redo.  This is bound to be much more efficient than
loading a single row into memory and doing lm 10^10 times, because lmFit

is

written to do many linear models at the same time.

I am suggesting that this is a potential general strategy.


And this approach could be run even (potentially) with different chunks

on

different nodes.

that seems to me to be an important if not essential desideratum.

what then is the role of C++?  extracting a chunk?  preexisting

utilities?


When I say C++ I just mean write an efficient implementation that works

on

a chunk, like lmFit.  It is true that anything that works on a chunk will
work on a single row/column (like lmFit) but there are possibilities for
optimization when you work at the chunk level.

Obviously not all computations can be done chunkwise.  But for those that
can, this is a strategy which is independent of the data backend.

I wonder whether this "obviously not" needs to be rethought.  Algorithms
that are implemented to work with data holistically may need
to be reexpressed so that they can succeed with chunkwise access.  Is

this

a new mindset needed for holist developers, or can the
effective data decompositions occur autonomously?

Well, I would say it is obvious that not all computations can be done
chunkwise.  But of course, in the limit of extremely large data,

algorithms

which needs to cycle over everything no longer scale.  So in that case

all

practical computations can be done chunkwise, out of necessity.  For

single

cell right now where it is just millions of cells on the horizon people
will think that they can get "standard" holistic approaches to work (and
that is probably true).  If they had a billion cells they probably

wouldn't

think about that.

Kasper

If you need direct access to the data in the backend in C++  it will be
extremely backend dependent what is fast and how to do it.  That doesn't
mean we shouldn't do it though.

Best,
Kasper



On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey <

st...@channing.harvard.edu<

mailto:st...@channing.harvard.edu>> wrote:
Kylie, thanks for reminding us of matter -- I saw you speak about this at
the first Bioconductor Boston Meetup, but it
went like lightning.   For developers contemplating an approach to
representing high-volume rectangular data,
where there is no dominant legacy format, it is natural to wonder whether
HDF5 would be adequate, and,
further, to wonder how to demonstrate that it is or is not dominated by
some other approach for a given set
of tasks.  Should we devise a set of bioinformatic benchmark problems to
foster comparison and informed
decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
contemplate benchmarking with it?

On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.be...@northeastern.edu<
mailto:k.be...@northeastern.edu>>
wrote:

It’s not there yet, but I plan to expose a C++ API for my disk-backed
matrix objects in the next version of my ‘matter’ package.

It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
objects at the R level, especially if using a frontend like

DelayedArray

on

top of them, but it would be nice to have a common C++ API that I could
hook into as well (a la Rcpp), so new C/C++ could be re-used across

various

backends more easily.

Kylie

~~~
Kylie Ariel Bemis
Future Faculty Fellow
College of Computer and Information Science
Northeastern University
kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url?u=http-3A__kuwisdelu.github.io_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=fSRhAUD8T-r7DYaWBk9MoCQJeITrNmKX-1ZwZVtaISk&e=><https://

kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url?u=https-3A__kuwisdelu.github.io_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=wgiAIZjLv2OCvDPgV80yWizDZZN_Icla1Xs84hAieOI&e=>>





On Feb 24, 2017, at 4:50 PM, Aaron Lun <a...@wehi.edu.au<mailto:alun@

wehi.edu.au><mailto:alun@<mailto:alun@>

wehi.edu.au<https://urldefense.proofpoint.com/v2/url?u=http-3A__wehi.edu.au_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=mdmCex6kampvZYMbfvpZeocnDj-4bnCjsld5yo_CJsE&e=>>>
 wrote:

It's a good place to start, though it would be very handy to have a

C(++)

API that can be linked against. I'm not sure how much work that would
entail but it would give downstream developers a lot more options. Sort

of

like how we can link to Rhtslib, which speeds up a lot of BAM file
processing, instead of just relying on Rsamtools.


-Aaron

________________________________
From: Tim Triche, Jr. <tim.tri...@gmail.com<mailto:

tim.tri...@gmail.com

<mailto:tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>>
Sent: Saturday, 25 February 2017 8:34:58 AM
To: Aaron Lun
Cc: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org><mailto:

bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>>

Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?

yes

the DelayedArray framework that handles HDF5Array, etc. seems like the
right choice?

--t

On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <a...@wehi.edu.au<mailto:

a...@wehi.edu.au><mailto:alun@<mailto:alun@>

wehi.edu.au<https://urldefense.proofpoint.com/v2/url?u=http-3A__wehi.edu.au_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=mdmCex6kampvZYMbfvpZeocnDj-4bnCjsld5yo_CJsE&e=>><mailto:a...@wehi.edu.au<mailto:

a...@wehi.edu.au>>> wrote:

Hi everyone,

I just attended the Human Cell Atlas meeting in Stanford, and people

were

talking about gene expression matrices for >1 million cells. If we

assume

that we can get non-zero expression profiles for ~5000 genes, we�d be
talking about a 5000 x 1 million matrix for the raw count data. This

would

be 20-40 GB in size, which would clearly benefit from sparse (via

Matrix)

or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5,

etc.).


I�m wondering whether there is any appetite amongst us for making a
consistent BioC API to handle these matrices, sort of like what
BiocParallel does for multicore and snow. It goes without saying that

the

different matrix representations should have consistent functions at

the

level (rbind/cbind, etc.) but it would also be nice to have an

integrated

C/C++ API (accessible via LinkedTo). There�s many non-trivial things

that

can be done with this type of data, and it is often faster and more

memory

efficient to do these complex operations in compiled code.

I was thinking of something that you could supply any supported matrix
representation to a registered function via .Call; the C++ constructor
would recognise the type of matrix during class instantiation; and
operations (row/column/random read access, also possibly various ways

of

writing a matrix) would be overloaded and behave as required for the

class.

Only the implementation of the API would need to care about the nitty
gritty of each representation, and we would all be free to write code

that

actually does the interesting analytical stuff.

Anyway, just throwing some thoughts out there. Any comments

appreciated.


Cheers,

Aaron

        [[alternative HTML version deleted]]


_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org><mailto:

Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>><mailto:

Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>> mailing

list

https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=


[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=





         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=


         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=


        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ4RNrEUnjYFUZouU2GPwLkclQf3E&e=


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



This email message may contain legally privileged and/or...{{dropped:2}}

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Re: [Bioc-devel] any interest in a BiocMatrix core package?

Reply via email to