[jira] [Commented] (MAHOUT-1500) H2O integration

Dmitriy Lyubimov (JIRA) Sun, 27 Apr 2014 14:42:24 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982481#comment-13982481
 ]


Dmitriy Lyubimov commented on MAHOUT-1500:
------------------------------------------

bq. The rationale for doing the work externally is largely the non-technical 
opposition from Dmitriy.

Not sure what is not technical in my previous post. Or pretty much any post 
attached to this jira on my behalf. 

I am glad some github code is finally officially confirmed to be tied to this 
very M-1500 issue for the first time.

However, i very much don't want to get pulled into discussion on height 
measurements of moral grounds here. Which is why it is the last time i post on 
this issue, since it obviously became pretty toxic for me to touch since desire 
to discredit-by-spin of my position has become so palpable. 

I have measured technical merit of those arguments given to me so far, 
privately or publicly, while consciously pushing objectivity levers of mine 
into their extreme "max" position; and unfortunately i don't think i found much 
substance to overcome the problems i have already reported. *But this is just a 
matter of opinion. And i already gave 0 vote on this. So i don't see why you 
would want to do anything different w.r.t. submitting this work for further 
review with people on this forum* based solely on my arguments -- even if i 
have been privy to some additional information about this development before it 
was announced. I am not significant from the progress of this work point of 
view. My arguments might be of some value though. 

So, for the last time, to recap what it was.

*(A) critique of the idea of having anything blockwise-distributed under Matrix 
api as it exists today*.

As i mentioned above, the x2o-matrix code itself refers to core contracts as 
"performance bug" ( here we mean in-core abstraction of element-wise direct 
access, element-wise and vector-wise iterators, and in-core optimizer specific 
contracts). If implementation cannot satisfy core contracts of abstraction, it 
follows directly that the abstraction is not useful for the implementation. In 
other words, if the algorithms using abstraction need to pay attention to what 
actual implementation class actually lies underneath, then again, abstraction 
has failed by definition. 

Concerns like that could be allayed in some (not common) cases by declaring 
operations optionally supported (e.g. as in ByteBuffer$array()). However, in 
such situations optional contract is planned in the very first place rather 
than by alteration, as it would likely break existing users of the abstaction. 

Optional contracts also do not cover so numerous and so core-concern contracts 
as suggested by this "performance bug" qualifier (like i said, 95% of current 
Mahout code is using elementwise or vector-wise iterators whenever Matrix or 
Vector type is involved). So I don't consider declaring optional support for 
the family of those in-core contracts of Matrix and Vector a reconciliation 
path for this design problem.  

And I haven't heard any solid technical rebuttal to this from OOA point of view 
that would somehow vindicate this design in my mind.

*End-of-critique. Alternatives* 

*(B)* Alternaltively, suppose we really wanted to go this way (i.e. marry 
something like "h2o-ized variation of DistributedRowMatrix" with AbstractMatrix 
using common mix-ins), then ideally solid design would imply  re-working Matrix 
apis in order to split them further to separate into finer classes of concerns 
than those that exist today: algebraic ops, incore optimizer ops, and  
element-wise access concerns for in-core and distributed models (i.e stuff like 
getQuick, setQuick and Iterable vs. mapBlock). 

And then we would say that we have some mix-in (interface) that addresses all 
algebraic ops regardless of whether it is distributed or in-core backing.

This sounds kind of right, doesn't it.

However, this brings us back to the  issue of destabilizing in-core Matrix api, 
splitting interfaces into hair, and hence sending ripple effects of code 
refactoring throughout, perhaps even beyond Mahout codebase. 

This cost in my opinion is not sufficiently outweighed by benefits of having 
some common in-core and distributed algebraic mix-ins among distributed and 
in-core stuff. Instead, algebraic operator-centric approach in my experience 
turned out much more cleaner pragmatically from distributed optimizer point of 
view,  and resulted in much cleaner separation of in-core and distributed math 
concerns even in the end-user algorithms. 

Further on, even purely algebraic stuff is unlikely to be totally common (e.g. 
slice operators for vectors and elements are not supported in distributed stuff 
-- instead, mapBlock operator is implied there to get access to  in-core 
iterators of the blocks; in-place operators are generally bad for distributed 
plans too). This means even further split of API which at first seemed to be 
fairly same for both in-core and distributed stuff. That's my pragmatical net 
remainder of the Spark bindings work.

*(C)* Another angle of attack on x2o integration IMO would be plugging x2o 
engines  into optimizer, which this work (M-1500) doesn't target. I rate 
possibility of this happening as quite tepid at the moment, because x2o 
programming model is not rich enough to provide things like zipping identically 
distributed datasets, very general shuffle model (e.g. many-to-many shuffle), 
advanced partition management (shuffless resplit-coalesce), and so on. I am not 
even sure if there's a clear concept of combiner type operation.  That 
observation leaves very bleak prospects for physical layer realization of the 
DRMLike scala stuff using H2O.

So when [~tdunning] speaks of DSL integration, he most probably speaks of Scala 
bindings, not distributed DSL bindings. So this will create further 
fragmentation of approaches and goes against "write once, run anywhere" concept 
there. More likely, with this approach there would be "write once for H2O" and 
"write once for everything else". Which is not end of the world, but it doesn't 
sound appealing and it certainly doesn't seem to imply coherent H20 integration 
-- not coherent with distributed algebra bindings, anyway.

*(D)* And yet a third thought i probably have not yet said in this jira: I 
think the best path for any sort of benefits from x20 integration would be 
borrowing the compression techniques for columnar in-core data frame blocks, 
that's where x2o strength is said to be above anything else.  But at this point 
my understanding no one has any intention to work this angle either.

I am not supportive of A and B as explained. 
I am dubious about but not i am not sufficiently qualified to judge on C 
alternative.
I am supportive of alternative D.

Thank you for reading till the end.

-d

> H2O integration
> ---------------
>
>                 Key: MAHOUT-1500
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Anand Avati
>             Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high 
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector, 
> and more as we make progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1500) H2O integration

Reply via email to