[
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956824#comment-13956824
]
Dmitriy Lyubimov edited comment on MAHOUT-1500 at 4/1/14 6:01 PM:
------------------------------------------------------------------
bq. Now it seems to me (with my limited exploring of Mahout) that it might
actually be viable to provide a "hadoop alternative" in the form of an
alternate implementation of DistributedRowMatrix (instead of AbstractMatrix)
yes that's what i meant. On Scala side, this is done by introducing mix-ins
DrmLike, RLikeOps, RLikeDrmOps, RLikeVectorOps etc.etc. On java side, working
with mix-ins (functionality-filled traits) is of course not easy, but the
important point is that it should be an alternative hierarchy with an identical
intersection of optimized linalg operators (operator-oriented semantics in
linear algebra).
I. e. assumption is that to the end user (developer) it is more important that
notation
{code}
a dot b
{code}
means exactly the same regardless of whether a and b in-core or distributed;
but it matters significantly less whether a and b descend from different
hierarchies (e.g. Matrix or DRM), as long as operator dot(A,B) is defined for
all possible type combinations (sparse, dense, distributed).
bq. and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and
thereby allow for a runtime choice of Hadoop vs H2O.
I care significantly less about Job api and Hadoop MR in particular. It is my
belief they are non-essential to the math user and therefore should be avoided
altogether (and such notion is eliminated in Spark Bindings)
bq. This seems like a reasonable first step?
Yes -- with caveat that logical mix-ins for distributed and in-core already
exists in Scala and Spark Bindings. Like i said, ideally mapping this logical
layer into a particular physical layer seems to be an indefinitely better
architecture to me, than creating yet-another logical layer specific to a
particular back. However, i see that it would be hard to converge on that, or
at least i don't see how. I will extract an architecture slide from my talk and
post a link to illustrate the idea a bit later.
was (Author: dlyubimov):
bq. Now it seems to me (with my limited exploring of Mahout) that it might
actually be viable to provide a "hadoop alternative" in the form of an
alternate implementation of DistributedRowMatrix (instead of AbstractMatrix)
yes that's what i meant. On Scala side, this is done by introducing mix-ins
DrmLike, RLikeOps, RLikeDrmOps, RLikeVectorOps etc.etc. On java side, working
with mix-ins (functionality-filled traits) is of course not easy, but the
important point is that it should be an alternative hierarchy with an identical
intersection of optimized linalg operators (operator-oriented semantics in
linear algebra).
I. e. assumption is that to the end user (developer) it is more important that
notation
{code}
a dot b
{code}
means exactly the same regardless of whether a and b in-core or distributed;
but it matters significantly less whether a and b descend from Matrix or DRM,
as long as operator dot(A,B) is defined for all possible type combinations
(sparse, dense, distributed).
bq. and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and
thereby allow for a runtime choice of Hadoop vs H2O.
I care significantly less about Job api and Hadoop MR in particular. It is my
belief they are non-essential to the math user and therefore should be avoided
altogether (and such notion is eliminated in Spark Bindings)
bq. This seems like a reasonable first step?
Yes -- with caveat that logical mix-ins for distributed and in-core already
exists in Scala and Spark Bindings. Like i said, ideally mapping this logical
layer into a particular physical layer seems to be an indefinitely better
architecture to me, than creating yet-another logical layer specific to a
particular back. However, i see that it would be hard to converge on that, or
at least i don't see how. I will extract an architecture slide from my talk and
post a link to illustrate the idea a bit later.
> H2O integration
> ---------------
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
> Issue Type: Improvement
> Reporter: Anand Avati
> Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector,
> and more as we make progress.
--
This message was sent by Atlassian JIRA
(v6.2#6252)