[
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982481#comment-13982481
]
Dmitriy Lyubimov commented on MAHOUT-1500:
------------------------------------------
bq. The rationale for doing the work externally is largely the non-technical
opposition from Dmitriy.
Not sure what is not technical in my previous post. Or pretty much any post
attached to this jira on my behalf.
I am glad some github code is finally officially confirmed to be tied to this
very M-1500 issue for the first time.
However, i very much don't want to get pulled into discussion on height
measurements of moral grounds here. Which is why it is the last time i post on
this issue, since it obviously became pretty toxic for me to touch since desire
to discredit-by-spin of my position has become so palpable.
I have measured technical merit of those arguments given to me so far,
privately or publicly, while consciously pushing objectivity levers of mine
into their extreme "max" position; and unfortunately i don't think i found much
substance to overcome the problems i have already reported. *But this is just a
matter of opinion. And i already gave 0 vote on this. So i don't see why you
would want to do anything different w.r.t. submitting this work for further
review with people on this forum* based solely on my arguments -- even if i
have been privy to some additional information about this development before it
was announced. I am not significant from the progress of this work point of
view. My arguments might be of some value though.
So, for the last time, to recap what it was.
*(A) critique of the idea of having anything blockwise-distributed under Matrix
api as it exists today*.
As i mentioned above, the x2o-matrix code itself refers to core contracts as
"performance bug" ( here we mean in-core abstraction of element-wise direct
access, element-wise and vector-wise iterators, and in-core optimizer specific
contracts). If implementation cannot satisfy core contracts of abstraction, it
follows directly that the abstraction is not useful for the implementation. In
other words, if the algorithms using abstraction need to pay attention to what
actual implementation class actually lies underneath, then again, abstraction
has failed by definition.
Concerns like that could be allayed in some (not common) cases by declaring
operations optionally supported (e.g. as in ByteBuffer$array()). However, in
such situations optional contract is planned in the very first place rather
than by alteration, as it would likely break existing users of the abstaction.
Optional contracts also do not cover so numerous and so core-concern contracts
as suggested by this "performance bug" qualifier (like i said, 95% of current
Mahout code is using elementwise or vector-wise iterators whenever Matrix or
Vector type is involved). So I don't consider declaring optional support for
the family of those in-core contracts of Matrix and Vector a reconciliation
path for this design problem.
And I haven't heard any solid technical rebuttal to this from OOA point of view
that would somehow vindicate this design in my mind.
*End-of-critique. Alternatives*
*(B)* Alternaltively, suppose we really wanted to go this way (i.e. marry
something like "h2o-ized variation of DistributedRowMatrix" with AbstractMatrix
using common mix-ins), then ideally solid design would imply re-working Matrix
apis in order to split them further to separate into finer classes of concerns
than those that exist today: algebraic ops, incore optimizer ops, and
element-wise access concerns for in-core and distributed models (i.e stuff like
getQuick, setQuick and Iterable vs. mapBlock).
And then we would say that we have some mix-in (interface) that addresses all
algebraic ops regardless of whether it is distributed or in-core backing.
This sounds kind of right, doesn't it.
However, this brings us back to the issue of destabilizing in-core Matrix api,
splitting interfaces into hair, and hence sending ripple effects of code
refactoring throughout, perhaps even beyond Mahout codebase.
This cost in my opinion is not sufficiently outweighed by benefits of having
some common in-core and distributed algebraic mix-ins among distributed and
in-core stuff. Instead, algebraic operator-centric approach in my experience
turned out much more cleaner pragmatically from distributed optimizer point of
view, and resulted in much cleaner separation of in-core and distributed math
concerns even in the end-user algorithms.
Further on, even purely algebraic stuff is unlikely to be totally common (e.g.
slice operators for vectors and elements are not supported in distributed stuff
-- instead, mapBlock operator is implied there to get access to in-core
iterators of the blocks; in-place operators are generally bad for distributed
plans too). This means even further split of API which at first seemed to be
fairly same for both in-core and distributed stuff. That's my pragmatical net
remainder of the Spark bindings work.
*(C)* Another angle of attack on x2o integration IMO would be plugging x2o
engines into optimizer, which this work (M-1500) doesn't target. I rate
possibility of this happening as quite tepid at the moment, because x2o
programming model is not rich enough to provide things like zipping identically
distributed datasets, very general shuffle model (e.g. many-to-many shuffle),
advanced partition management (shuffless resplit-coalesce), and so on. I am not
even sure if there's a clear concept of combiner type operation. That
observation leaves very bleak prospects for physical layer realization of the
DRMLike scala stuff using H2O.
So when [~tdunning] speaks of DSL integration, he most probably speaks of Scala
bindings, not distributed DSL bindings. So this will create further
fragmentation of approaches and goes against "write once, run anywhere" concept
there. More likely, with this approach there would be "write once for H2O" and
"write once for everything else". Which is not end of the world, but it doesn't
sound appealing and it certainly doesn't seem to imply coherent H20 integration
-- not coherent with distributed algebra bindings, anyway.
*(D)* And yet a third thought i probably have not yet said in this jira: I
think the best path for any sort of benefits from x20 integration would be
borrowing the compression techniques for columnar in-core data frame blocks,
that's where x2o strength is said to be above anything else. But at this point
my understanding no one has any intention to work this angle either.
I am not supportive of A and B as explained.
I am dubious about but not i am not sufficiently qualified to judge on C
alternative.
I am supportive of alternative D.
Thank you for reading till the end.
-d
> H2O integration
> ---------------
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
> Issue Type: Improvement
> Reporter: Anand Avati
> Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector,
> and more as we make progress.
--
This message was sent by Atlassian JIRA
(v6.2#6252)