On Tue, Apr 1, 2014 at 3:09 AM, Ted Dunning <[email protected]> wrote:
> I would rather see a matrix that looks local but acts global so that
> coders can produce very simple code that is still parallelized.
>
And that's exactly how it is done in Bindings.
This discussion is not about that though. this discussion is about why
doing that on Matrix and Vector hierarchy is a bad idea.
Trying to explain why.
Matrix and Vector api, historically, mix in a lot of concerns (not just
linalg operators). E.g. they also include things like element data access
views and patterns (getQuick, getRow, iterateNonZero); in-core specific
optimizer things like */
double getLookupCost();
double getIteratorAdvanceCost();
etc. Normally that is addressed via Mix-ins but it wasn't (and it is
hard in Java in general).
Corrollary to that is simple fact that 95% of mahout (and, more
importantly, outside code) is something like
for (el:v.iterateNonZero()) {
... do something with element
}
*which is not parallelizable at all and would require major
refactoring of apis and all user code to make it so. *
*Corollary to that are 2 arguments :*
*(1) doing what you say on AbstractMatrix or AbstractVector hierarchy
is not possible without a "nuclear option" on the api, which will send
a ripple effect inside and outside Mahout (my outside code in
particular too);*
(2) and even if we invoked "nuclear option", doing so does not have
benefit compared to introducing a parallel type hierarchy for
distributed matrices since write-once-run-everywhere works there too.
The idea of write-once-run either in-core or out-of-core is very
noble, but in practice is neither quite feasible (mostly because of
component lifecycle and optimization checkpointing concerns), nor it
has a significant value. (i.e. if one can have ssvd and dssvd in 29
lines, assuming same algorithm even has a parallelization strategy),
then there's no harm in having two separate things for in-core and
out-of-core -- dssvd() and ssvd().