Why aren’t we using linalg.Vector and its siblings? The same could be asked for linalg.Matrix. If we want to prune dependencies this would help and would also significantly increase interoperability.
Case-now: I have a real need to cluster items in a CF type input matrix. The input matrix A’ has row of items. I need to drop this into a sequence file and use Mahout’s hadoop KMeans. Ugh. Or I need to convert A’ into an RDD of linalg.Vectors and use MLlib Kmeans. The conversion is not too bad and maybe could be helped with some implicit conversions mahout.Vector <-> linalg.Vector (maybe mahout.DRM <-> linalg.Matrix, though not needed for Kmeans). Case-possible: If we adopted linalg.Vector as the native format and perhaps even linalg.Matrix this would give immediate interoperability in some areas including my specific need. It would significantly pare down dependencies not provided by the environment (Mahout-math). It would also support creating distributed computation methods that would work on MLlib and Mahout datasets addressing Gokhan’s question. I looked at another “Case-now” possibility, which was to go all MLlib with item similarity. I found that MLlib doesn’t have a transpose—“transpose, why would you want to do that?” Not even in the multiply form A’A, A’B, AA’, all used in item and row similarity. That stopped me from looking deeper. The strength and unique value of Mahout is the completeness of its generalized linear algebra DSL. But insistence on using Mahout specific data types is also a barrier for Spark people adopting the DSL. Not having lower level interoperability is a barrier both ways to mixing Mahout and MLlib—creating unnecessary either/or choices for devs. On Feb 5, 2015, at 1:32 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: On Thu, Feb 5, 2015 at 1:14 AM, Gokhan Capan <gkhn...@gmail.com> wrote: > What I am saying is that for certain algorithms including both > engine-specific (such as aggregation) and DSL stuff, what is the best way > of handling them? > > i) should we add the distributed operations to Mahout codebase as it is > proposed in #62? > Imo this can't go very well and very far (because of the engine specifics) but i'd be willing to see an experiment with simple things like map and reduce. Bigger quesitons are, where exactly we'll have to stop (we can't abstract all capabilities out there becuase of "common denominator" issues), and what percentage of methods will it truly allow to migrate to full backend portability. And if after doing all this, we will still find ourselves writing engine specific mixes, why bother. Wouldn't it be better to find a good, easy-to-replicate, incrementally-developed pattern to register and apply engine-specific strategies for every method? > > ii) should we have [engine]-ml modules (like spark-bindings and > h2o-bindings) where we can mix the DSL and engine-specific stuff? > This is not quite what i am proposing. Rather, engine-ml modules holding engine-specific _parts_ of algorithm. However, this really needs a POC over a guniea pig (similarly to how we POC'd algebra in the first place with ssvd and spca). > >