Github user srowen commented on the pull request:
https://github.com/apache/incubator-spark/pull/575#issuecomment-34954740
My $0.02 to the discussion:
1. Within whatever operations mllib provides, serialization can be
considered an implementation detail. But external serialization will come up,
and I favor supporting something terribly simple. Text-based "row,col,val"
format strikes me as most standard (which is not quite CSR but almost) since
this can be parsed by, say, R or Octave.
2. Agree. Its primary purpose is a hook into BLAS from Java, but its API is
"good enough" for purposes here I think in that it supports all the primitive
ops I think one would want, and the more complex standard ones like solving a
system.
3. I think one should assume sparse is incompatible with native code, yes.
I think the set of operations that are needed is pretty straightforward and
provided by anything one picks off the shelf.
On the one hand, it seems crazy to write yet another in-house
implementation. But I think it's a viable and rational alternative. The
argument for is that the set of operations is quite simple, and really it would
be nice to have an API exactly in-line with JBLAS as much as possible.
A quick way to achieve this is to repurpose the Commons Math class and chop
it up. At least, no need to write from scratch and rewrite bugs.
There's an idea in this thread to make a façade to insulate everything
from this choice. This also amounts to writing half of a matrix library, since
you will end up with a lot of engineering to maintain abstractions and
performance.
Here are my personal current top favorite ideas:
1. Use Commons Math everywhere and slip in JBLAS where needed. Consistent
API, no rewriting, and still get the speed where needed
2. Repurpose Commons Math sparse implementation to create a new sparse
counterpart to JBLAS API. Consistent API, a bit of rewriting needed.
3. The façade idea, implemented on top of Commons Math sparse and JBLAS
for now.
... and then long-term I would love to see that this question gets solved
really well by the likes of Breeze or something and then this project uses that.