[
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982497#comment-13982497
]
Ted Dunning commented on MAHOUT-1500:
-------------------------------------
[[email protected]]'s comments have several incorrect statements which lead to
incorrect conclusions.
These statements are both explicit and implicit and include in paraphrased form:
* A comment about a "performance bug" means that h2o can't implement the Matrix
API
This means that use of some operations may have impacts on performance that
could be surprisingly large to some programmers. The comment is intended to
warn implementors that these impacts could be large enough to essentially
prevent benefit from parallel computation. As such, their use would thwart
some of the purpose of using a parallel system. The reference to a
"performance bug" does not imply that the operations do not work and, indeed,
their availability might be handy during initial implementation of algorithms.
Section (A) makes points about validity of abstractions due to the requirements
to modify existing code, but that really doesn't apply since that isn't the
purpose of the current work.
* It is the intent of the h2o support of the Matrix API that all codes that use
the Matrix API should run and get parallel speedup
This is explicitly not a goal of the current effort. The goal of the current
effort is to use a well understood and stable Mahout API to experiment with
implementation techniques for parallel algorithms that are based on h2o. It is
a premise of this effort that the operations used in these hand built
implementations will have roughly similar execution patterns as will equivalent
programs that use the Scala bindings or the distributed DSL bindings. That
premise is unlikely to be massively incorrect and thus the current effort is
useful in terms of determining good h2o idioms for implementing matrix code.
The pattern of usage of the matrix API by other Mahout codes is completely
irrelevant to this effort.
* The h2o system is not rich enough in capabilities to support things like
zipping identically distributed data sets.
This is simply incorrect and is based on lack of knowledge of the h2o system.
The h2o primitives are different from Spark primitives. That means that
different idioms have to be used to generate similar results, but it doesn't
mean that h2o lacks these capabilities. In particular, the discord between
what [[email protected]] thinks that h2o can do and what it can do is large
enough that the entire section (C) in his comments is essentially vacuous since
it is based entirely on false premises.
The current results indicate that there considerable promise for h2o in terms
of these capabilities. More work is indicated.
* the current work would require massive revamping of the current Mahout Matrix
API.
The current work is a technical exploration of convenient and efficient
implementation techniques. It has no implications whatsoever regarding the
refactoring of the Mahout Matrix API. The current work does have implications
relative to any h2o shim layers that might ultimately be necessary, but that
has nothing to do with the current Mahout in-core API's. Section (B) is thus
also moot.
The emotional tenor of [[email protected]]'s comments are exactly what is
encouraging the h2o work to be done a bit apart. It simply isn't efficient to
have to answer so many off-topic points whenever any reports on work in
progress are given.
> H2O integration
> ---------------
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
> Issue Type: Improvement
> Reporter: Anand Avati
> Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector,
> and more as we make progress.
--
This message was sent by Atlassian JIRA
(v6.2#6252)