Posting on Pat's request for folks who do not follow jira developments. I probably will need to add this all to the site, although i thought it would be preliminary until it is actually part of an official release.
Anyway, here it is. Relevant jira issues is M-1346 and 1365. > Also I don't understand the status of your DRM work. Can you do multiply, > transpose etc with the code all running on Spark? > Yes and more. Everything in PDF is committed and passes functional tests. That includes all you said, and more, also SSVD and thin QR. Scalabindings were in the 0.9 release. Sparkbindings is post-0.9 commit but everything in pdf is in the trunk and passes functional unit test. near future: I still haven't commited PCA. There's also implicit feedback solver that i am also allowed to contribute but i will need to reshape it a little before i do (due to changes in the public version of the sparkbindings). > How solid is it? > Everything in PDF passes functional tests. I did not run SSVD code there at scale, but i had other solvers in the company that did run on actual cluster. If you can test SSVD scale, that would be awesome. I am currently on parental leave until 4/7, so i don't have resources to do that. Frankly, i will not have resources to run scale test in the company servers in the nearest future either. > You say it's not production worthy. > I did not say that. Actually, quite opposite. I said I had some solvers done in this environment internally at my company and they did not have known technical or scale issues, but we did not deploy them because of product priorities (so happens, the solver itself is only 10% of work that needs to happen in order for the whole product see production. The rest 90% require significant engineering resources which are not afforded for this particular feature). > How far from it? > I think i answered that. > > Talking about data frame operators does not help those of us who have not > used Spark yet. > Ok, data frames is just a thought at this point. This is all plotted in the R's image, not Spark's. This is only linear algebra part of it. But base R is also + stat and +data frames. You could think of programming model similar to that of R's for large data frames as well. This is not something Spark specific at all. Although, I think I read that there's an upcoming spark-based project in AmpLab as well that does mutable data frame manipulations, so chances are we may be able to use it directly in the future. However. One of the fundamental ideas in sparkbindings issue is to do a mere translation layer and not be tightly coupled to Spark. I.e. SSVD can be run on stratosphere optimizer without changing a line of its source by providing physical plan operators for Stratoshpere. Similarly it could be done with data frame api. Obviously, anything coming from AmpLab will be tightly coupled to Spark only. If we stick with the idea of providing just a translation layer, obviously we could think of a programming model that completely decouples data frame API from spark dependencies. Finally, of course, algorithm richness is important too. Like i said, i probably will spend a day or two to add impliciit feedback solver, unless Sebastian wants to pick some. I definitely can add something like K-means and Guassian Mixture EM fitter using this api very easily. There's also yet another solver for recommenders of MCMC nature i would like to try, but i think i need to do it for my company first and then release it if they allow it. I can only do so much work for free. -d
