Re: [jira] [Commented] (MAHOUT-1500) H2O integration

Sebastian Schelter Tue, 01 Apr 2014 23:40:29 -0700

I also agree that it is crucial to provide proper abstractions for the
distributed operations of an h2o backed matrix.


I suggest, we wait for some early stage example code and base our arguments
on that to not lose focus in the discussion.

-sebastian
Am 02.04.2014 07:51 schrieb "Dmitriy Lyubimov" <[email protected]>:

> Previous projects used Google docs.
>
> I personally produce accept any form of technical documentation for peer
> review. Have seen none so far.
>
> I also would like to ask participants to abstain from marketing and bio
> messages in the middle of a technical discussion thread, it is hard to find
> time to read thru it as it is. The Apache etiquette is that it is generally
> ok to make introductions of products however market-y, bios, services or
> persons on a dedicated thread with appropriate subject line. On technical
> threads I would really appreciate if participants keep on the issue, make
> very specific arguments and abstain from argumentation fallacies. [1]
>
> [1] http://en.wikipedia.org/wiki/List_of_fallacies
>
>
> On Tue, Apr 1, 2014 at 6:37 AM, Pat Ferrel <[email protected]> wrote:
>
> > Wow, this doesn't seem like a good thing for Jira. It looks like a
> fishing
> > expedition. It's far too broad to be actually implemented. If Ted or
> > someone has a concrete idea, why not put that in a Jira.
> >
> > Actually this brings up a problem with the current forums for
> > coordination. There must be a better way than email or vague tickets.
> There
> > are already many Spark Jiras that touch on each other and are hard to
> > follow as a group. Anyone wishing to contribute to the effort will have a
> > hard time tracking all the discussions.
> >
> > Why not a dev wiki were anyone can contribute, maybe open up the mahout
> > github wiki. Ideally it ends up pointing to specific Jiras and is a
> > collection point where the leaders like Dmitriy or Sebastian or Ted can
> > keep contributions on track. Not sure if the same could be done on the
> > Apache wiki but it's more about docs anyway. There are also ways to make
> > ticket dependencies but those are imo very hard to track.
> >
> > I've always worried that two separate engine integration efforts was
> going
> > to be a mess, if not a train wreck. It's going to confuse a lot of
> > potential contributors. Following disjointed email threads and vague or
> > scattered Jira tickets isn't going to help.
> >
> > On Apr 1, 2014, at 4:16 AM, Frank Scholten <[email protected]>
> wrote:
> >
> > I suggest to create a separate mahout-h2o module that shows a simple
> > end-to-end example, including vectorizing. I had a look at the h2o API,
> it
> > looks interesting, and I am curious to see how to vectorize data from
> > different sources. We could start by taking an existing example like
> > clustering Reuters for instance.
> >
> > I would not suggest to immediately try to extend from existing Mahout
> > APIs. I agree with Dmitriy that we shouldn't mix  distributed and local
> > code. After creating a few examples can we see where code can be reused
> and
> > where the boundaries are. It also gives everyone a feel for the h2o API.
> > Then we can extract common code.
> >
> > I also like Anand's idea of creating an h2o alternative of a Hadoop job.
> I
> > do like to see this being implemented as a Java bean with a separate CLI
> > driver so class it is easy to use in Java. Current Mahout jobs have to
> > called via main methods with String arrays. See the lucene2seq as an
> > example of the bean config idea.
> >
> > Frank
> >
> > On Apr 1, 2014, at 12:09, Ted Dunning <[email protected]> wrote:
> >
> > > I would rather see a matrix that looks local but acts global so that
> > coders can produce very simple code that is still parallelized.
> > >
> > > Sent from my iPhone
> > >
> > >> On Apr 1, 2014, at 11:09, "Anand Avati (JIRA)" <[email protected]>
> wrote:
> > >>
> > >>
> > >>  [
> >
> https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956283#comment-13956283
> ]
> > >>
> > >> Anand Avati commented on MAHOUT-1500:
> > >> -------------------------------------
> > >>
> > >> Thanks for your feedback, Dmitry.
> > >>
> > >> Now it seems to me (with my limited exploring of Mahout) that it might
> > actually be viable to provide a "hadoop alternative" in the form of an
> > alternate implementation of DistributedRowMatrix (instead of
> > AbstractMatrix) and AbstractJob (by internally using h2o's Frame/Vec and
> > MRTask2 APIs), and thereby allow for a runtime choice of Hadoop vs H2O.
> > This seems like a reasonable first step?
> > >>
> > >>> H2O integration
> > >>> ---------------
> > >>>
> > >>>              Key: MAHOUT-1500
> > >>>              URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> > >>>          Project: Mahout
> > >>>       Issue Type: Improvement
> > >>>         Reporter: Anand Avati
> > >>>          Fix For: 1.0
> > >>>
> > >>>
> > >>> Integration with h2o (github.com/0xdata/h2o) in order to exploit its
> > high performance computational abilities.
> > >>> Start with providing implementations of AbstractMatrix and
> > AbstractVector, and more as we make progress.
> > >>
> > >>
> > >>
> > >> --
> > >> This message was sent by Atlassian JIRA
> > >> (v6.2#6252)
> >
> >
>

Re: [jira] [Commented] (MAHOUT-1500) H2O integration

Reply via email to