Perhaps one of good ways to go about it is to look at the spark module of mahout.
Minimum stuff that is needed is stuff in sparkbindings.SparkEngine and CheckpointedDRM support. The idea is simple. When expressions are written, they are translated into logical operators impelmenting DrmLike[K] api which don't know anything of concrete engine. When expression checkpoint is invoked, optimizer applies rule-based optimizations (which may create more complex but still logical operators like A'B, AB' or A'A type of things, or elementwise unary function applications that are translated from elementwise expressions like exp(A*A). ). This stuff (strategies building engine-specific physical lineages for these blas-like building primitives) are found in sparkbindings.blas package. Unfortunately public version is kind of mm... behind of mine there quite a bit... These rewritten operators are passed to Engine.toPhysical which creates a "checkpointed" DRM. Checkpointed DRM is encapsulation of engine-specific physical operators (now logical part knows very little of them). In case of Spark, the logic goes over created TWA and translates it into Spark RDD lineage (applying some cost estimates along the way). So main points are to implement Engine.toPhysical and Flink's version of CheckpointedDRM. A few other concepts are broadcasting of in-core matrix and vectors (making them available in every task), and explicit cache pinning level (which is kind of right now maps 1:1 to supported cache strategies in spark - but that can be adjusted). Algorithm provide cache level hints in a sign that they intend to go over the same source matrix again and again, (welcome to iterative world...) Finally, another concept is that in-core stuff is backed by scalabindings dsl for mahout-math (i.e. pure in-core algebra) which is sort of consistent with distributed operations dsl. i.e. matrix multiplication will look like A %*% B regardless of whether A or B are in-core or distributed. The major difference for in-core is that in-place modification operators are enabled such as += *= or function assignment like mxA := exp _ (elementwise taking of an exponent) but distributed references of course are immutable (logically anyway). I need to update docs for all forms of in-core (aka 'scalabindings') DSL, but basic version of documentation kind of covers the capabilities more or less. So... as long as mahout-math's Vector and Matrix interface serialization is supported in the backend, that's the only persistence that is needed. No other types are either persisted or passed around. Mahout supports implicit Writable conversion for those in scala (or, more concretely, MatrixWritable and VectorWritable), and I have added native kryo support for those as well (not in public version). Glossary: DRM = distributed row matrix (row-wise partitioned matrix representation). TWA= tree walking automaton On Tue, Feb 3, 2015 at 5:33 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > I may be able to help. > > The official link on mahout talks page points to slide share, which > mangles slides in a weird way, but if it helps, here's the (hidden) link to > pptx source of those in case it helps: > > > http://mahout.apache.org/users/sparkbindings/MahoutScalaAndSparkBindings.pptx > > On Mon, Feb 2, 2015 at 1:00 AM, mkaul <k...@tu-berlin.de> wrote: > >> Ok makes sense to me! :) So we will find something else for the student to >> do. >> Would it then be possible for you to maybe very briefly describe your >> strategy of how >> you will make this work? Like slides for a design diagram maybe? >> It might be useful for us to see how the internals would look like at our >> end too. >> >> Cheers, >> Manu >> >> >> >> -- >> View this message in context: >> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Kicking-off-the-Machine-Learning-Library-tp2995p3594.html >> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list >> archive at Nabble.com. >> > >