I also agree that it is crucial to provide proper abstractions for the distributed operations of an h2o backed matrix.
I suggest, we wait for some early stage example code and base our arguments on that to not lose focus in the discussion. -sebastian Am 02.04.2014 07:51 schrieb "Dmitriy Lyubimov" <[email protected]>: > Previous projects used Google docs. > > I personally produce accept any form of technical documentation for peer > review. Have seen none so far. > > I also would like to ask participants to abstain from marketing and bio > messages in the middle of a technical discussion thread, it is hard to find > time to read thru it as it is. The Apache etiquette is that it is generally > ok to make introductions of products however market-y, bios, services or > persons on a dedicated thread with appropriate subject line. On technical > threads I would really appreciate if participants keep on the issue, make > very specific arguments and abstain from argumentation fallacies. [1] > > [1] http://en.wikipedia.org/wiki/List_of_fallacies > > > On Tue, Apr 1, 2014 at 6:37 AM, Pat Ferrel <[email protected]> wrote: > > > Wow, this doesn't seem like a good thing for Jira. It looks like a > fishing > > expedition. It's far too broad to be actually implemented. If Ted or > > someone has a concrete idea, why not put that in a Jira. > > > > Actually this brings up a problem with the current forums for > > coordination. There must be a better way than email or vague tickets. > There > > are already many Spark Jiras that touch on each other and are hard to > > follow as a group. Anyone wishing to contribute to the effort will have a > > hard time tracking all the discussions. > > > > Why not a dev wiki were anyone can contribute, maybe open up the mahout > > github wiki. Ideally it ends up pointing to specific Jiras and is a > > collection point where the leaders like Dmitriy or Sebastian or Ted can > > keep contributions on track. Not sure if the same could be done on the > > Apache wiki but it's more about docs anyway. There are also ways to make > > ticket dependencies but those are imo very hard to track. > > > > I've always worried that two separate engine integration efforts was > going > > to be a mess, if not a train wreck. It's going to confuse a lot of > > potential contributors. Following disjointed email threads and vague or > > scattered Jira tickets isn't going to help. > > > > On Apr 1, 2014, at 4:16 AM, Frank Scholten <[email protected]> > wrote: > > > > I suggest to create a separate mahout-h2o module that shows a simple > > end-to-end example, including vectorizing. I had a look at the h2o API, > it > > looks interesting, and I am curious to see how to vectorize data from > > different sources. We could start by taking an existing example like > > clustering Reuters for instance. > > > > I would not suggest to immediately try to extend from existing Mahout > > APIs. I agree with Dmitriy that we shouldn't mix distributed and local > > code. After creating a few examples can we see where code can be reused > and > > where the boundaries are. It also gives everyone a feel for the h2o API. > > Then we can extract common code. > > > > I also like Anand's idea of creating an h2o alternative of a Hadoop job. > I > > do like to see this being implemented as a Java bean with a separate CLI > > driver so class it is easy to use in Java. Current Mahout jobs have to > > called via main methods with String arrays. See the lucene2seq as an > > example of the bean config idea. > > > > Frank > > > > On Apr 1, 2014, at 12:09, Ted Dunning <[email protected]> wrote: > > > > > I would rather see a matrix that looks local but acts global so that > > coders can produce very simple code that is still parallelized. > > > > > > Sent from my iPhone > > > > > >> On Apr 1, 2014, at 11:09, "Anand Avati (JIRA)" <[email protected]> > wrote: > > >> > > >> > > >> [ > > > https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956283#comment-13956283 > ] > > >> > > >> Anand Avati commented on MAHOUT-1500: > > >> ------------------------------------- > > >> > > >> Thanks for your feedback, Dmitry. > > >> > > >> Now it seems to me (with my limited exploring of Mahout) that it might > > actually be viable to provide a "hadoop alternative" in the form of an > > alternate implementation of DistributedRowMatrix (instead of > > AbstractMatrix) and AbstractJob (by internally using h2o's Frame/Vec and > > MRTask2 APIs), and thereby allow for a runtime choice of Hadoop vs H2O. > > This seems like a reasonable first step? > > >> > > >>> H2O integration > > >>> --------------- > > >>> > > >>> Key: MAHOUT-1500 > > >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1500 > > >>> Project: Mahout > > >>> Issue Type: Improvement > > >>> Reporter: Anand Avati > > >>> Fix For: 1.0 > > >>> > > >>> > > >>> Integration with h2o (github.com/0xdata/h2o) in order to exploit its > > high performance computational abilities. > > >>> Start with providing implementations of AbstractMatrix and > > AbstractVector, and more as we make progress. > > >> > > >> > > >> > > >> -- > > >> This message was sent by Atlassian JIRA > > >> (v6.2#6252) > > > > >
