Re: [jira] [Commented] (MAHOUT-1500) H2O integration

Dmitriy Lyubimov Tue, 01 Apr 2014 22:51:26 -0700

Previous projects used Google docs.

I personally produce accept any form of technical documentation for peer
review. Have seen none so far.


I also would like to ask participants to abstain from marketing and bio
messages in the middle of a technical discussion thread, it is hard to find
time to read thru it as it is. The Apache etiquette is that it is generally
ok to make introductions of products however market-y, bios, services or
persons on a dedicated thread with appropriate subject line. On technical
threads I would really appreciate if participants keep on the issue, make
very specific arguments and abstain from argumentation fallacies. [1]

[1] http://en.wikipedia.org/wiki/List_of_fallacies


On Tue, Apr 1, 2014 at 6:37 AM, Pat Ferrel <[email protected]> wrote:

> Wow, this doesn't seem like a good thing for Jira. It looks like a fishing
> expedition. It's far too broad to be actually implemented. If Ted or
> someone has a concrete idea, why not put that in a Jira.
>
> Actually this brings up a problem with the current forums for
> coordination. There must be a better way than email or vague tickets. There
> are already many Spark Jiras that touch on each other and are hard to
> follow as a group. Anyone wishing to contribute to the effort will have a
> hard time tracking all the discussions.
>
> Why not a dev wiki were anyone can contribute, maybe open up the mahout
> github wiki. Ideally it ends up pointing to specific Jiras and is a
> collection point where the leaders like Dmitriy or Sebastian or Ted can
> keep contributions on track. Not sure if the same could be done on the
> Apache wiki but it's more about docs anyway. There are also ways to make
> ticket dependencies but those are imo very hard to track.
>
> I've always worried that two separate engine integration efforts was going
> to be a mess, if not a train wreck. It's going to confuse a lot of
> potential contributors. Following disjointed email threads and vague or
> scattered Jira tickets isn't going to help.
>
> On Apr 1, 2014, at 4:16 AM, Frank Scholten <[email protected]> wrote:
>
> I suggest to create a separate mahout-h2o module that shows a simple
> end-to-end example, including vectorizing. I had a look at the h2o API, it
> looks interesting, and I am curious to see how to vectorize data from
> different sources. We could start by taking an existing example like
> clustering Reuters for instance.
>
> I would not suggest to immediately try to extend from existing Mahout
> APIs. I agree with Dmitriy that we shouldn't mix  distributed and local
> code. After creating a few examples can we see where code can be reused and
> where the boundaries are. It also gives everyone a feel for the h2o API.
> Then we can extract common code.
>
> I also like Anand's idea of creating an h2o alternative of a Hadoop job. I
> do like to see this being implemented as a Java bean with a separate CLI
> driver so class it is easy to use in Java. Current Mahout jobs have to
> called via main methods with String arrays. See the lucene2seq as an
> example of the bean config idea.
>
> Frank
>
> On Apr 1, 2014, at 12:09, Ted Dunning <[email protected]> wrote:
>
> > I would rather see a matrix that looks local but acts global so that
> coders can produce very simple code that is still parallelized.
> >
> > Sent from my iPhone
> >
> >> On Apr 1, 2014, at 11:09, "Anand Avati (JIRA)" <[email protected]> wrote:
> >>
> >>
> >>  [
> https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956283#comment-13956283]
> >>
> >> Anand Avati commented on MAHOUT-1500:
> >> -------------------------------------
> >>
> >> Thanks for your feedback, Dmitry.
> >>
> >> Now it seems to me (with my limited exploring of Mahout) that it might
> actually be viable to provide a "hadoop alternative" in the form of an
> alternate implementation of DistributedRowMatrix (instead of
> AbstractMatrix) and AbstractJob (by internally using h2o's Frame/Vec and
> MRTask2 APIs), and thereby allow for a runtime choice of Hadoop vs H2O.
> This seems like a reasonable first step?
> >>
> >>> H2O integration
> >>> ---------------
> >>>
> >>>              Key: MAHOUT-1500
> >>>              URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> >>>          Project: Mahout
> >>>       Issue Type: Improvement
> >>>         Reporter: Anand Avati
> >>>          Fix For: 1.0
> >>>
> >>>
> >>> Integration with h2o (github.com/0xdata/h2o) in order to exploit its
> high performance computational abilities.
> >>> Start with providing implementations of AbstractMatrix and
> AbstractVector, and more as we make progress.
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.2#6252)
>
>

Re: [jira] [Commented] (MAHOUT-1500) H2O integration

Reply via email to