Re: [jira] [Commented] (MAHOUT-1500) H2O integration

Pat Ferrel Tue, 01 Apr 2014 06:37:37 -0700

Wow, this doesn’t seem like a good thing for Jira. It looks like a fishing 
expedition. It’s far too broad to be actually implemented. If Ted or someone 
has a concrete idea, why not put that in a Jira.

Actually this brings up a problem with the current forums for coordination. 
There must be a better way than email or vague tickets. There are already many 
Spark Jiras that touch on each other and are hard to follow as a group. Anyone 
wishing to contribute to the effort will have a hard time tracking all the 
discussions.

Why not a dev wiki were anyone can contribute, maybe open up the mahout github 
wiki. Ideally it ends up pointing to specific Jiras and is a collection point 
where the leaders like Dmitriy or Sebastian or Ted can keep contributions on 
track. Not sure if the same could be done on the Apache wiki but it’s more 
about docs anyway. There are also ways to make ticket dependencies but those 
are imo very hard to track. 

I’ve always worried that two separate engine integration efforts was going to 
be a mess, if not a train wreck. It’s going to confuse a lot of potential 
contributors. Following disjointed email threads and vague or scattered Jira 
tickets isn’t going to help.

On Apr 1, 2014, at 4:16 AM, Frank Scholten <[email protected]> wrote:

I suggest to create a separate mahout-h2o module that shows a simple end-to-end 
example, including vectorizing. I had a look at the h2o API, it looks 
interesting, and I am curious to see how to vectorize data from different 
sources. We could start by taking an existing example like clustering Reuters 
for instance.

I would not suggest to immediately try to extend from existing Mahout APIs. I 
agree with Dmitriy that we shouldn't mix  distributed and local code. After 
creating a few examples can we see where code can be reused and where the 
boundaries are. It also gives everyone a feel for the h2o API. Then we can 
extract common code.

I also like Anand's idea of creating an h2o alternative of a Hadoop job. I do 
like to see this being implemented as a Java bean with a separate CLI driver so 
class it is easy to use in Java. Current Mahout jobs have to called via main 
methods with String arrays. See the lucene2seq as an example of the bean config 
idea.

Frank

On Apr 1, 2014, at 12:09, Ted Dunning <[email protected]> wrote:

> I would rather see a matrix that looks local but acts global so that coders 
> can produce very simple code that is still parallelized.  
> 
> Sent from my iPhone
> 
>> On Apr 1, 2014, at 11:09, "Anand Avati (JIRA)" <[email protected]> wrote:
>> 
>> 
>>  [ 
>> https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956283#comment-13956283
>>  ] 
>> 
>> Anand Avati commented on MAHOUT-1500:
>> -------------------------------------
>> 
>> Thanks for your feedback, Dmitry.
>> 
>> Now it seems to me (with my limited exploring of Mahout) that it might 
>> actually be viable to provide a "hadoop alternative" in the form of an 
>> alternate implementation of DistributedRowMatrix (instead of AbstractMatrix) 
>> and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and 
>> thereby allow for a runtime choice of Hadoop vs H2O. This seems like a 
>> reasonable first step?
>> 
>>> H2O integration
>>> ---------------
>>> 
>>>              Key: MAHOUT-1500
>>>              URL: https://issues.apache.org/jira/browse/MAHOUT-1500
>>>          Project: Mahout
>>>       Issue Type: Improvement
>>>         Reporter: Anand Avati
>>>          Fix For: 1.0
>>> 
>>> 
>>> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high 
>>> performance computational abilities.
>>> Start with providing implementations of AbstractMatrix and AbstractVector, 
>>> and more as we make progress.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)

Re: [jira] [Commented] (MAHOUT-1500) H2O integration

Reply via email to