[jira] [Comment Edited] (MAHOUT-1500) H2O integration

Dmitriy Lyubimov (JIRA) Tue, 01 Apr 2014 11:02:17 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956824#comment-13956824
 ]


Dmitriy Lyubimov edited comment on MAHOUT-1500 at 4/1/14 6:01 PM:
------------------------------------------------------------------

bq. Now it seems to me (with my limited exploring of Mahout) that it might 
actually be viable to provide a "hadoop alternative" in the form of an 
alternate implementation of DistributedRowMatrix (instead of AbstractMatrix) 

yes that's what i meant. On Scala side, this is done by introducing mix-ins 
DrmLike, RLikeOps, RLikeDrmOps, RLikeVectorOps etc.etc. On java side, working 
with mix-ins (functionality-filled traits) is of course not easy, but the 
important point is that it should be an alternative hierarchy with an identical 
intersection of optimized linalg operators (operator-oriented semantics in 
linear algebra). 

I. e. assumption is that to the end user (developer) it is more important that 
notation
{code}
a dot b
{code} 

means exactly the same regardless of whether a and b in-core or distributed; 
but it matters significantly less whether a and b descend from different 
hierarchies (e.g. Matrix or DRM), as long as operator dot(A,B) is defined for 
all possible type combinations (sparse, dense, distributed).

bq. and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and 
thereby allow for a runtime choice of Hadoop vs H2O. 

I care significantly less about Job api and Hadoop MR in particular. It is my 
belief they are non-essential to the math user and therefore should be avoided 
altogether (and such notion is eliminated in Spark Bindings)

bq. This seems like a reasonable first step?
Yes -- with caveat that logical mix-ins for distributed and in-core already 
exists in Scala and Spark Bindings. Like i said, ideally mapping this logical 
layer into a particular physical layer seems to be an indefinitely better 
architecture to me, than creating yet-another logical layer specific to a 
particular back. However, i see that it would be hard to converge on that, or 
at least i don't see how. I will extract an architecture slide from my talk and 
post a link to illustrate the idea a bit later.


was (Author: dlyubimov):
bq. Now it seems to me (with my limited exploring of Mahout) that it might 
actually be viable to provide a "hadoop alternative" in the form of an 
alternate implementation of DistributedRowMatrix (instead of AbstractMatrix) 

yes that's what i meant. On Scala side, this is done by introducing mix-ins 
DrmLike, RLikeOps, RLikeDrmOps, RLikeVectorOps etc.etc. On java side, working 
with mix-ins (functionality-filled traits) is of course not easy, but the 
important point is that it should be an alternative hierarchy with an identical 
intersection of optimized linalg operators (operator-oriented semantics in 
linear algebra). 

I. e. assumption is that to the end user (developer) it is more important that 
notation
{code}
a dot b
{code} 

means exactly the same regardless of whether a and b in-core or distributed; 
but it matters significantly less whether a and b descend from Matrix or DRM, 
as long as operator dot(A,B) is defined for all possible type combinations 
(sparse, dense, distributed).

bq. and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and 
thereby allow for a runtime choice of Hadoop vs H2O. 

I care significantly less about Job api and Hadoop MR in particular. It is my 
belief they are non-essential to the math user and therefore should be avoided 
altogether (and such notion is eliminated in Spark Bindings)

bq. This seems like a reasonable first step?
Yes -- with caveat that logical mix-ins for distributed and in-core already 
exists in Scala and Spark Bindings. Like i said, ideally mapping this logical 
layer into a particular physical layer seems to be an indefinitely better 
architecture to me, than creating yet-another logical layer specific to a 
particular back. However, i see that it would be hard to converge on that, or 
at least i don't see how. I will extract an architecture slide from my talk and 
post a link to illustrate the idea a bit later.

> H2O integration
> ---------------
>
>                 Key: MAHOUT-1500
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Anand Avati
>             Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high 
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector, 
> and more as we make progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MAHOUT-1500) H2O integration

Reply via email to