[jira] [Commented] (MAHOUT-1500) H2O integration

Dmitriy Lyubimov (JIRA) Tue, 01 Apr 2014 01:34:23 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956246#comment-13956246
 ]


Dmitriy Lyubimov commented on MAHOUT-1500:
------------------------------------------

bq. out will be "depending" on h2o (like how it "depends on" Spark), and there 
will be enough infrastructure implementations (like Matrix, Vector, possibly 
Job) which would allow for existing algorithms to be easily refactored to use 
H2O in place of (or along with), say, Hadoop/MR and/or DistributedRowMatrix etc.

@Anand: Mahout does not "depend on spark" at the Matrix and Vector API. 
Instead, integration with Spark is on physical plan operator layer and a 
completely separate logical layer matrix representation (DrmLike, etc) in order 
to cleanly separate "shared mem" and "shared nothing" use cases. And of course, 
no Spark actual dependencies ever sip into mahout-math module. We actually 
spent a lot of effort to unmarry that module from even Hadoop dependencies, 
IIRC.  I expect it to stay the same.

o.a.m.math.Matrix and Vector API are reserved for in-core operations only, and 
all algorithms around it are built assuming "shared memory" model (i.e. they 
don't see it as a problem to iterate over all non-zeros in a single thread). 
Dumping "shared nothing" and "shared mem" use cases into single api in my not 
so humble opinion makes no sense to me (unless the proposal is to work towards 
"unholy mess"  architectural standards.) 

This would be confusing to devs to no end. No algorithm IMO can be written to 
be completely agnostic of "shared-mem" vs. "shared nothing" issues. I.e. 
distributed functional stuff will be able of course to work in a single 
machine, but this simply amounts to logic "write everything as if it is 
distributed using FP", so this is not the answer.

So  -1 on this. This is not nearly the same as how Spark was integrated.

My suggestion is to either integrate with linear algebra optimizer at physical 
layer (which it seems to be quite impossible to me today because of h2o 
programming model), absent of which i'd suggest to start on completely 
yet-another set of "shared-nothing" api just like it was done for Spark. Of 
course, we'd be incoherent here once again, which is why i'd not like even this 
-- this might as well be a happily standalone or contrib project with no common 
parts.

Messing with Job API is less objectionable I guess, since Job is a 
shared-nothing api to begin with; however, you are providing too few details to 
make a sensible opinion on this, so -0 on this at this point.

> H2O integration
> ---------------
>
>                 Key: MAHOUT-1500
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Anand Avati
>             Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high 
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector, 
> and more as we make progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1500) H2O integration

Reply via email to