Spark's DataFrame is obviously not agnostic. I don't believe there's a good way to abstract it. Unfortunately. I think getting too much into distributed operation abstraction is a bit dangerous.
I think MLI was one project that attempted to do that -- but it did not take off i guess. or at least there were 0 commits in like 18 months there if i am not mistaken, and it never made it into spark tree. So it is a good question. if we need a dataframe in flink, what do we do. I am open to suggestions. I very much don't want to do "yet another abstract language-integrated Spark SQL" feature. Given resources, IMO it'd be better to take on fewer goals but make them shine. So i'd do spark-based seq2sparse version first and that'd give some ideas how to create ports/abstractions of that work to Flink. On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap....@outlook.com> wrote: > > On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote: > >> Re: Gokhan's PR post: here are my thoughts but i did not want to post it >> there since they are going beyond the scope of that PR's work to chase the >> root of the issue. >> >> on quasi-algebraic methods >> ======================== >> >> What is the dilemma here? don't see any. >> >> I already explained that no more than 25% of algorithms are truly 100% >> algebraic. But about 80% cannot avoid using some algebra and close to 95% >> could benefit from using algebra (even stochastic and monte carlo stuff). >> >> So we are building system that allows us to cut developer's work by at >> least 60% and make his work also more readable by 3000%. As far as I am >> concerned, that fulfills the goal. And I am perfectly happy writing a mix >> of engine-specific primitives and algebra. >> >> That's why i am a bit skeptical about attempts to abstract non-algebraic >> primitives such as row-wise aggregators in one of the pull requests. >> Engine-specific primitives and algebra can perfectly co-exist in the guts. >> And that's how i am doing my stuff in practice, except i now can skip 80% >> effort on algebra and bridging incompatible intputs-outputs. >> > I am **definitely** not advocating messing with the algebraic optimizer. > That was what I saw as the plus side to Gokhan's PR- a separate engine > abstraction for qasi/non-algebraic distributed methods. I didn't comment > on the PR either because admittedly I did not have a chance to spend a lot > of time on it. But my quick takeaway was that we could take some very > useful and hopefully (close to) ubiquitous distributed operators and pass > them through to the engine "guts". > > I briefly looked through some of the flink and h2o code and noticed > Flink's aggregateOperator [1] > and h2o's MapReduce API and [2] my thought was that we could write pass > through operators for some of the more useful operations from math-scala > and then implement them fully in their respective packages. Though I am > not sure how this would work on either cases w.r.t. partitioning. e.g. on > h2o's distributed DataFrame. or flink for that matter. Again, I havent had > alot of time to look at these and see if this would work at all. > > My thought was not to bring primitive engine specific aggregetors, > combiners, etc. into math-scala. > > I had thought though that we were trying to develop a fully engine > agnostic algorithm library in on top of the R-Like distributed BLAS. > > > So would the idea be to implement i.e. seq2sparse fully in the spark > module? It would seem to fracture the project a bit. > > > Or to implement algorithms sequentially if mapBlock() will not suffice and > then optimize them in their respective modules? > > > > >> None of that means that R-like algebra cannot be engine agnostic. So >> people >> are unhappy about not being able to write the whole in totaly agnostic >> way? >> And so they (falsely) infer the pieces of their work cannot be helped by >> agnosticism individually, or the tools are not being as good as they might >> be without backend agnosticism? Sorry, but I fail to see the logic there. >> >> We proved algebra can be agnostic. I don't think this notion should be >> disputed. >> >> And even if there were a shred of real benefit by making algebra tools >> un-agnostic, it would not ever outweigh tons of good we could get for the >> project by integrating with e.g. Flink folks. This one one the points >> MLLib >> will never be able to overcome -- to be truly shared ML platform where >> people could create and share ML, but not just a bunch of ad-hoc spaghetty >> of distributed api calls and Spark-nailed black boxes. >> >> Well yes methodology implementations will still have native distributed >> calls. Just not nearly as many as they otherwise would, and will be much >> more easier to support on another back-end using Strategy patterns. E.g. >> implicit feedback problem that i originally wrote as quasi-method for >> Spark >> only, would've taken just an hour or so to add strategy for flink, since >> it >> retains all in-core and distributed algebra work as is. >> >> Not to mention benefit of single type pipelining. >> >> And once we add hardware-accelerated bindings for in-core stuff, all these >> methods would immediately benefit from it. >> >> On MLLib interoperability issues, >> ========================= >> >> well, let me ask you this: what it means to be MLLib-interoperable? is >> MLLib even interoperable within itself? >> >> E.g. i remember there was one most frequent request on the list here: how >> can we cluster dimensionally-reduced data? >> >> Let's look what it takes to do this in MLLib: First, we run tf-idf, which >> produces collection of vectors (and where did our document ids go? not >> sure); then we'd have to run svd or pca, both of which would accept >> RowMatrix (bummer! but we have collection of vectors); which would produce >> RowMatrix as well but kmeans training takes RDD of vectors (bummer >> again!). >> >> Not directly pluggable, although semi-trivially or trivially convertible. >> Plus strips off information that we potentially already have computed >> earlier in the pipeline, so we'd need to compute it again. I think problem >> is well demonstrated. >> >> Or, say, ALS stuff (implicit als in particular) is really an algebraic >> problem. Should be taking input in form of matrices (that my feature >> extraction algebraic pipeline perhaps has just prepared) but really takes >> POJOs. Bummer again. >> >> So what it is exactly we should be interoperable with in this picture if >> MLLib itself is not consistent? >> >> Let's look at the type system in flux there: >> >> we have >> (1) collection of vectors, >> (2) matrix of known dimensions for collection of vectors (row matrix), >> (3) indexedRowMatrix which is matrix of known dimension with keys that can >> be _only_ long; and >> (4) unknown but not infinitesimal amount of POJO-oriented approaches. >> >> But ok, let's constrain ourselves to matrix types only. >> >> Multitude of matrix types creates problems for tasks that require >> consistent key propagation (like SVD or PCA or tf-idf, well demonstrated >> in the case of mllib). In the aforementioned case of dimensionality >> reduction over document collection, there's simply no way to propagate >> document ids to the rows of dimensionally-reduced data. As in none at all. >> as in hard no-work-around-exists stop. >> >> So. There's truly no need for multiple incompatible matrix types. There >> has >> to be just single matrix type. Just flexible one. And everything algebraic >> needs to use it. >> >> And if geometry is needed, then it could be either already known or lazily >> computed, but if it is not needed, nobody bothers to compute it. (i.e. >> truly no need And this knowledge should not be lost just because we have >> to >> convert between types. >> >> And if we want to express complex row keys such as for cluster assignments >> for example (my real case) then we could have a type with keys like >> Tuple2(rowKeyType, cluster-string). >> >> And that nobody really cares if intermediate results are really be row or >> column partitioned. >> >> All within single type of things. >> >> Bottom line, "interoperability" with mllib is both hard and trivial. >> >> Trivial is because whenever you need to convert, it is one line of code >> and >> also a trivial distributed map fusion element. (I do have pipelines >> streaming mllib methods within DRM-based pipelines, not just speculating). >> >> Hard is because there are so many types you may need/want to convert >> between, so there's not much point to even try to write converters for all >> possible cases but rather go on need-to-do basis. >> >> It is also hard because their type system obviously continues evolving as >> we speak. So no point chase the rabbit in the making. >> >> Epilogue >> ======= >> There's no problem with the philosophy of the distributed and >> non-distributed algebra approach. It is incredibly useful in practice and >> I >> have proven it continuously (what is in public domain is just tip of the >> iceberg). >> >> Rather, there's organizational anemia in the project. Like corporate legal >> interests (that includes me not being able to do quick turnaround of >> fixes), and not having been able to tap into university resources. But i >> don't believe in any technical philosophy problem. >> >> So given that aforementioned resource/logistical anemia, it will likely >> take some when it would seem it gets worse before it gets better. But >> afaik there are multiple efforts going on behind the curtains to break red >> tapes. so i'd just wait a bit. >> >> > > [1] https://github.com/apache/flink/blob/master/flink-java/ > src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java > [2] http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/ > 5/docs-website/developuser/java.html > > >