Perhaps one of good ways to go about it is to look at the spark module of
mahout.

Minimum stuff that is needed is stuff in sparkbindings.SparkEngine and
CheckpointedDRM support.

The idea is simple. When expressions are written, they are translated into
logical operators impelmenting DrmLike[K] api which don't know anything of
concrete engine.

When expression checkpoint is invoked, optimizer applies rule-based
optimizations (which may create more complex but still logical operators
like A'B, AB' or A'A type of things, or elementwise unary function
applications that are translated from elementwise expressions like
exp(A*A). ). This stuff (strategies building engine-specific physical
lineages for these blas-like building primitives) are found in
sparkbindings.blas package. Unfortunately public version is kind of mm...
behind of mine there quite a bit...

These rewritten operators are passed to Engine.toPhysical which creates a
"checkpointed" DRM. Checkpointed DRM is encapsulation of engine-specific
physical operators (now logical part knows very little of them). In case of
Spark, the logic goes over created TWA and translates it into Spark RDD
lineage (applying some cost estimates along the way).

So main points are to implement Engine.toPhysical and Flink's version of
CheckpointedDRM.

A few other concepts are broadcasting of in-core matrix and vectors (making
them available in every task), and explicit cache pinning level (which is
kind of right now maps 1:1 to supported cache strategies in spark - but
that can be adjusted). Algorithm provide cache level hints in a sign that
they intend to go over the same source matrix again and again, (welcome to
iterative world...)

Finally, another concept is that in-core stuff is backed by scalabindings
dsl for mahout-math (i.e. pure in-core algebra) which is sort of consistent
with distributed operations dsl. i.e. matrix multiplication will look like
A %*% B regardless of whether A or B are in-core or distributed. The major
difference for in-core is that in-place modification operators are enabled
such as += *= or function assignment like mxA := exp _ (elementwise taking
of an exponent) but distributed references of course are immutable
(logically anyway). I need to update docs for all forms of in-core (aka
'scalabindings') DSL, but basic version of documentation kind of covers the
capabilities more or less.

So... as long as mahout-math's Vector and Matrix interface serialization is
supported in the backend, that's the only persistence that is needed. No
other types are either persisted or passed around.

Mahout supports implicit Writable conversion for those in scala (or, more
concretely, MatrixWritable and VectorWritable), and I have added native
kryo support for those as well (not in public version).


Glossary: DRM = distributed row matrix (row-wise partitioned matrix
representation).
TWA= tree walking automaton

On Tue, Feb 3, 2015 at 5:33 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> I may be able to help.
>
> The official link on mahout talks page points to slide share, which
> mangles slides in a weird way, but if it helps, here's the (hidden) link to
> pptx source of those in case it helps:
>
>
> http://mahout.apache.org/users/sparkbindings/MahoutScalaAndSparkBindings.pptx
>
> On Mon, Feb 2, 2015 at 1:00 AM, mkaul <k...@tu-berlin.de> wrote:
>
>> Ok makes sense to me! :) So we will find something else for the student to
>> do.
>> Would it then be possible for you to maybe very briefly describe your
>> strategy of how
>> you will make this work? Like slides for a design diagram maybe?
>> It might be useful for us to see how the internals would look like at our
>> end too.
>>
>> Cheers,
>> Manu
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-flink-incubator-mailing-list-archive.1008284.n3.nabble.com/Kicking-off-the-Machine-Learning-Library-tp2995p3594.html
>> Sent from the Apache Flink (Incubator) Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>
>

Reply via email to