[jira] [Commented] (MAHOUT-1500) H2O integration

Dmitriy Lyubimov (JIRA) Tue, 01 Apr 2014 18:10:25 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957216#comment-13957216
 ]


Dmitriy Lyubimov commented on MAHOUT-1500:
------------------------------------------

@Anand, Bottom line, the core of AbstractMatrix and Vector is elementwise 
iterators and direct element accessors. Lacking distributed programming, they 
don't work for the distributed stuff. 

There are two ways with such approach: either declare core abstractions 
unsupported in distributed implementation, which just proves AbstractMatrix and 
Vector are not good abstractions for that work. (why would one need an 
abstraction, if its major and core contracts are all of a sudden declared 
optional or deprecated). 

Truth to be told, there is some Matrix api that uses FP -- two major things are 
aggregate() and assign(). However, this still doesn't get us anywhere in a 
sense that we should support _all_ core contracts, not just assign() and 
aggregate().

Another way of going about it is to heavily refactor core abstraction in favor 
of functional support, while deprecating or eliminating direct access. I call 
this "nuclear option". Because it sends ripple effects not only thru Mahout, 
but thru any 3rd party code that uses mahout-math. (in my case specifically). 
It will force people reconsider using mahout because of stability issues in the 
areas where it was promised to be stable.

Extending DistributedRowMatrix api.. I kind of dubious about it as well, since 
it is also unusable without major FP infusion, and frankly kind of ancient.

More likely, a completely new FP-laced distributed Matrix representation is 
desired. SparkBindings went that path and created FP-laced DRM api. But this is 
entirely Scala side abstraction, with Scala function literals etc. So if you 
are looking to create a java distributed matrix abstraction, this is not going 
to be useful at all either.

So more likely, you need a completely new FP-oriented java API interface. 
Something like X2OMatrix.java. This will fragment project even further, but all 
marketing fluff excluding, that's the only realistic option i see that might 
work. 

I would also question (kinda) the wisdom of a standalone distributed vector 
abstraction. On Hadoop side and spark side this abstraction is completely 
bypassed (it is assumed that real vector will always fit into single machine 
memory). In situations where vector might be formed as a result of distributed 
operation (e.g. A %*% x) the result is simply a distributed single-column 
matrix, from which the column can be always collected in front end via 
collection/slicing api. 

 

> H2O integration
> ---------------
>
>                 Key: MAHOUT-1500
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Anand Avati
>             Fix For: 1.0
>
>
> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high 
> performance computational abilities.
> Start with providing implementations of AbstractMatrix and AbstractVector, 
> and more as we make progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1500) H2O integration

Reply via email to