[
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956246#comment-13956246
]
Dmitriy Lyubimov commented on MAHOUT-1500:
------------------------------------------
bq. out will be "depending" on h2o (like how it "depends on" Spark), and there
will be enough infrastructure implementations (like Matrix, Vector, possibly
Job) which would allow for existing algorithms to be easily refactored to use
H2O in place of (or along with), say, Hadoop/MR and/or DistributedRowMatrix etc.
@Anand: Mahout does not "depend on spark" at the Matrix and Vector API.
Instead, integration with Spark is on physical plan operator layer and a
completely separate logical layer matrix representation (DrmLike, etc) in order
to cleanly separate "shared mem" and "shared nothing" use cases. And of course,
no Spark actual dependencies ever sip into mahout-math module. We actually
spent a lot of effort to unmarry that module from even Hadoop dependencies,
IIRC. I expect it to stay the same.
o.a.m.math.Matrix and Vector API are reserved for in-core operations only, and
all algorithms around it are built assuming "shared memory" model (i.e. they
don't see it as a problem to iterate over all non-zeros in a single thread).
Dumping "shared nothing" and "shared mem" use cases into single api in my not
so humble opinion makes no sense to me (unless the proposal is to work towards
"unholy mess" architectural standards.)
This would be confusing to devs to no end. No algorithm IMO can be written to
be completely agnostic of "shared-mem" vs. "shared nothing" issues. I.e.
distributed functional stuff will be able of course to work in a single
machine, but this simply amounts to logic "write everything as if it is
distributed using FP", so this is not the answer.
So -1 on this. This is not nearly the same as how Spark was integrated.
My suggestion is to either integrate with linear algebra optimizer at physical
layer (which it seems to be quite impossible to me today because of h2o
programming model), absent of which i'd suggest to start on completely
yet-another set of "shared-nothing" api just like it was done for Spark. Of
course, we'd be incoherent here once again, which is why i'd not like even this
-- this might as well be a happily standalone or contrib project with no common
parts.
Messing with Job API is less objectionable I guess, since Job is a
shared-nothing api to begin with; however, you are providing too few details to
make a sensible opinion on this, so -0 on this at this point.
> H2O integration
> ---------------
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
> Issue Type: Improvement
> Reporter: Anand Avati
> Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector,
> and more as we make progress.
--
This message was sent by Atlassian JIRA
(v6.2#6252)