Wow, this doesn’t seem like a good thing for Jira. It looks like a fishing expedition. It’s far too broad to be actually implemented. If Ted or someone has a concrete idea, why not put that in a Jira.
Actually this brings up a problem with the current forums for coordination. There must be a better way than email or vague tickets. There are already many Spark Jiras that touch on each other and are hard to follow as a group. Anyone wishing to contribute to the effort will have a hard time tracking all the discussions. Why not a dev wiki were anyone can contribute, maybe open up the mahout github wiki. Ideally it ends up pointing to specific Jiras and is a collection point where the leaders like Dmitriy or Sebastian or Ted can keep contributions on track. Not sure if the same could be done on the Apache wiki but it’s more about docs anyway. There are also ways to make ticket dependencies but those are imo very hard to track. I’ve always worried that two separate engine integration efforts was going to be a mess, if not a train wreck. It’s going to confuse a lot of potential contributors. Following disjointed email threads and vague or scattered Jira tickets isn’t going to help. On Apr 1, 2014, at 4:16 AM, Frank Scholten <[email protected]> wrote: I suggest to create a separate mahout-h2o module that shows a simple end-to-end example, including vectorizing. I had a look at the h2o API, it looks interesting, and I am curious to see how to vectorize data from different sources. We could start by taking an existing example like clustering Reuters for instance. I would not suggest to immediately try to extend from existing Mahout APIs. I agree with Dmitriy that we shouldn't mix distributed and local code. After creating a few examples can we see where code can be reused and where the boundaries are. It also gives everyone a feel for the h2o API. Then we can extract common code. I also like Anand's idea of creating an h2o alternative of a Hadoop job. I do like to see this being implemented as a Java bean with a separate CLI driver so class it is easy to use in Java. Current Mahout jobs have to called via main methods with String arrays. See the lucene2seq as an example of the bean config idea. Frank On Apr 1, 2014, at 12:09, Ted Dunning <[email protected]> wrote: > I would rather see a matrix that looks local but acts global so that coders > can produce very simple code that is still parallelized. > > Sent from my iPhone > >> On Apr 1, 2014, at 11:09, "Anand Avati (JIRA)" <[email protected]> wrote: >> >> >> [ >> https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13956283#comment-13956283 >> ] >> >> Anand Avati commented on MAHOUT-1500: >> ------------------------------------- >> >> Thanks for your feedback, Dmitry. >> >> Now it seems to me (with my limited exploring of Mahout) that it might >> actually be viable to provide a "hadoop alternative" in the form of an >> alternate implementation of DistributedRowMatrix (instead of AbstractMatrix) >> and AbstractJob (by internally using h2o's Frame/Vec and MRTask2 APIs), and >> thereby allow for a runtime choice of Hadoop vs H2O. This seems like a >> reasonable first step? >> >>> H2O integration >>> --------------- >>> >>> Key: MAHOUT-1500 >>> URL: https://issues.apache.org/jira/browse/MAHOUT-1500 >>> Project: Mahout >>> Issue Type: Improvement >>> Reporter: Anand Avati >>> Fix For: 1.0 >>> >>> >>> Integration with h2o (github.com/0xdata/h2o) in order to exploit its high >>> performance computational abilities. >>> Start with providing implementations of AbstractMatrix and AbstractVector, >>> and more as we make progress. >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v6.2#6252)
