Update on status of scala and spark bindings work

Dmitriy Lyubimov Sat, 15 Mar 2014 17:39:25 -0700

Posting on Pat's request for folks who do not follow jira developments. I
probably will need to add this all to the site, although i thought it would
be preliminary until it is actually part of an official release.


Anyway, here it is. Relevant jira issues is M-1346 and 1365.



> Also I don't understand the status of your DRM work. Can you do multiply,
> transpose etc with the code all running on  Spark?
>

Yes and more. Everything in PDF is committed and passes functional tests.
That includes all you said, and more, also SSVD  and thin QR.

Scalabindings were in the 0.9 release. Sparkbindings is post-0.9 commit but
everything in pdf is in the trunk and passes functional unit test.

near future: I still haven't commited PCA. There's also implicit feedback
solver that i am also allowed to contribute but i will need to reshape it a
little before i do (due to changes in the public version of the
sparkbindings).


> How solid is it?
>

Everything in PDF passes functional tests. I did not run SSVD code there at
scale, but i had other solvers in the company that did run on actual
cluster. If you can test SSVD scale, that would be awesome. I am currently
on parental leave until 4/7, so i don't have resources to do that. Frankly,
i will not have resources to run scale test in the company servers in the
nearest future either.



> You say it's not production worthy.
>

I did not say that. Actually, quite opposite. I said I had some solvers
done in this environment internally at my company and they did not have
known technical or scale issues, but we did not deploy them because of
product priorities (so happens, the solver itself is only 10% of work that
needs to happen in order for the whole product see production. The rest 90%
require significant engineering resources which are not afforded for this
particular feature).



> How far from it?
>

I think i answered that.


>
> Talking about data frame operators does not help those of us who have not
> used Spark yet.
>

Ok, data frames is just a thought at this point. This is all plotted in the
R's image, not Spark's. This is only linear algebra part of it. But base R
is also + stat and +data frames. You could think of programming model
similar to that of R's for large data frames as well. This is not something
Spark specific at all.

Although, I think I read that there's an upcoming spark-based project in
AmpLab as well that does mutable data frame manipulations, so chances are
we may be able to use it directly in the future.

However. One of the fundamental ideas in sparkbindings issue is to do a
mere  translation layer and not be tightly coupled to Spark. I.e. SSVD can
be run on stratosphere optimizer without changing a line  of its source by
providing physical plan operators for Stratoshpere. Similarly it could be
done with data frame api. Obviously, anything coming from AmpLab will be
tightly coupled to Spark only. If we stick with the idea of providing just
a translation layer, obviously we could think of a programming model that
completely decouples data frame API from spark dependencies.

Finally, of course, algorithm richness is important too. Like i said, i
probably will spend a day or two to add impliciit feedback solver, unless
Sebastian wants to pick some.

I definitely can add something like K-means and Guassian Mixture EM fitter
using this api very easily. There's also yet another solver for
recommenders of MCMC nature i would like to try, but i think i need to do
it for my company first and then release it if they allow it. I can only do
so much work for free.

-d

Update on status of scala and spark bindings work

Reply via email to