Re: [jira] [Commented] (MAHOUT-1500) H2O integration

Frank Scholten Tue, 01 Apr 2014 04:20:08 -0700

I suggest to create a separate mahout-h2o module that shows a simple end-to-end 
example, including vectorizing. I had a look at the h2o API, it looks 
interesting, and I am curious to see how to vectorize data from different 
sources. We could start by taking an existing example like clustering Reuters 
for instance.

I would not suggest to immediately try to extend from existing Mahout APIs. I 
agree with Dmitriy that we shouldn't mix  distributed and local code. After 
creating a few examples can we see where code can be reused and where the 
boundaries are. It also gives everyone a feel for the h2o API. Then we can 
extract common code.

I also like Anand's idea of creating an h2o alternative of a Hadoop job. I do 
like to see this being implemented as a Java bean with a separate CLI driver so 
class it is easy to use in Java. Current Mahout jobs have to called via main 
methods with String arrays. See the lucene2seq as an example of the bean config 
idea.

Frank

On Apr 1, 2014, at 12:09, Ted Dunning <[email protected]> wrote:

> I would rather see a matrix that looks local but acts global so that coders 
> can produce very simple code that is still parallelized.  
> 
> Sent from my iPhone
> 
>> On Apr 1, 2014, at 11:09, "Anand Avati (JIRA)" <[email protected]> wrote:
>> 
>> 
>>   [ 
>> https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956283#comment-13956283
>>  ] 
>> 
>> Anand Avati commented on MAHOUT-1500:
>> -------------------------------------
>> 
>> Thanks for your feedback, Dmitry.
>> 
>> Now it seems to me (with my limited exploring of Mahout) that it might 
>> actually be viable to provide a "hadoop alternative" in the form of an 
>> alternate implementation of DistributedRowMatrix (instead of AbstractMatrix) 
>> and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and 
>> thereby allow for a runtime choice of Hadoop vs H2O. This seems like a 
>> reasonable first step?
>> 
>>> H2O integration
>>> ---------------
>>> 
>>>               Key: MAHOUT-1500
>>>               URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>>>           Project: Mahout
>>>        Issue Type: Improvement
>>>          Reporter: Anand Avati
>>>           Fix For: 1.0
>>> 
>>> 
>>> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high 
>>> performance computational abilities.
>>> Start with providing implementations of AbstractMatrix and AbstractVector, 
>>> and more as we make progress.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)

Re: [jira] [Commented] (MAHOUT-1500) H2O integration

Reply via email to