[
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957216#comment-13957216
]
Dmitriy Lyubimov commented on MAHOUT-1500:
------------------------------------------
@Anand, Bottom line, the core of AbstractMatrix and Vector is elementwise
iterators and direct element accessors. Lacking distributed programming, they
don't work for the distributed stuff.
There are two ways with such approach: either declare core abstractions
unsupported in distributed implementation, which just proves AbstractMatrix and
Vector are not good abstractions for that work. (why would one need an
abstraction, if its major and core contracts are all of a sudden declared
optional or deprecated).
Truth to be told, there is some Matrix api that uses FP -- two major things are
aggregate() and assign(). However, this still doesn't get us anywhere in a
sense that we should support _all_ core contracts, not just assign() and
aggregate().
Another way of going about it is to heavily refactor core abstraction in favor
of functional support, while deprecating or eliminating direct access. I call
this "nuclear option". Because it sends ripple effects not only thru Mahout,
but thru any 3rd party code that uses mahout-math. (in my case specifically).
It will force people reconsider using mahout because of stability issues in the
areas where it was promised to be stable.
Extending DistributedRowMatrix api.. I kind of dubious about it as well, since
it is also unusable without major FP infusion, and frankly kind of ancient.
More likely, a completely new FP-laced distributed Matrix representation is
desired. SparkBindings went that path and created FP-laced DRM api. But this is
entirely Scala side abstraction, with Scala function literals etc. So if you
are looking to create a java distributed matrix abstraction, this is not going
to be useful at all either.
So more likely, you need a completely new FP-oriented java API interface.
Something like X2OMatrix.java. This will fragment project even further, but all
marketing fluff excluding, that's the only realistic option i see that might
work.
I would also question (kinda) the wisdom of a standalone distributed vector
abstraction. On Hadoop side and spark side this abstraction is completely
bypassed (it is assumed that real vector will always fit into single machine
memory). In situations where vector might be formed as a result of distributed
operation (e.g. A %*% x) the result is simply a distributed single-column
matrix, from which the column can be always collected in front end via
collection/slicing api.
> H2O integration
> ---------------
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
> Issue Type: Improvement
> Reporter: Anand Avati
> Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector,
> and more as we make progress.
--
This message was sent by Atlassian JIRA
(v6.2#6252)