But also keep in mind that Flink folks are eager to allocate resources for ML work. So maybe that's the way to work it -- create a DataFrame-based seq2sparse port and then just hand it off to them to add to either Flink directly (but with DRM output), or as a part of Mahout.
On Wed, Feb 4, 2015 at 2:07 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > Spark's DataFrame is obviously not agnostic. > > I don't believe there's a good way to abstract it. Unfortunately. I think > getting too much into distributed operation abstraction is a bit dangerous. > > I think MLI was one project that attempted to do that -- but it did not > take off i guess. or at least there were 0 commits in like 18 months there > if i am not mistaken, and it never made it into spark tree. > > So it is a good question. if we need a dataframe in flink, what do we do. > I am open to suggestions. I very much don't want to do "yet another > abstract language-integrated Spark SQL" feature. > > Given resources, IMO it'd be better to take on fewer goals but make them > shine. So i'd do spark-based seq2sparse version first and that'd give some > ideas how to create ports/abstractions of that work to Flink. > > > > On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap....@outlook.com> wrote: > >> >> On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote: >> >>> Re: Gokhan's PR post: here are my thoughts but i did not want to post it >>> there since they are going beyond the scope of that PR's work to chase >>> the >>> root of the issue. >>> >>> on quasi-algebraic methods >>> ======================== >>> >>> What is the dilemma here? don't see any. >>> >>> I already explained that no more than 25% of algorithms are truly 100% >>> algebraic. But about 80% cannot avoid using some algebra and close to 95% >>> could benefit from using algebra (even stochastic and monte carlo stuff). >>> >>> So we are building system that allows us to cut developer's work by at >>> least 60% and make his work also more readable by 3000%. As far as I am >>> concerned, that fulfills the goal. And I am perfectly happy writing a mix >>> of engine-specific primitives and algebra. >>> >>> That's why i am a bit skeptical about attempts to abstract non-algebraic >>> primitives such as row-wise aggregators in one of the pull requests. >>> Engine-specific primitives and algebra can perfectly co-exist in the >>> guts. >>> And that's how i am doing my stuff in practice, except i now can skip 80% >>> effort on algebra and bridging incompatible intputs-outputs. >>> >> I am **definitely** not advocating messing with the algebraic optimizer. >> That was what I saw as the plus side to Gokhan's PR- a separate engine >> abstraction for qasi/non-algebraic distributed methods. I didn't comment >> on the PR either because admittedly I did not have a chance to spend a lot >> of time on it. But my quick takeaway was that we could take some very >> useful and hopefully (close to) ubiquitous distributed operators and pass >> them through to the engine "guts". >> >> I briefly looked through some of the flink and h2o code and noticed >> Flink's aggregateOperator [1] >> and h2o's MapReduce API and [2] my thought was that we could write pass >> through operators for some of the more useful operations from math-scala >> and then implement them fully in their respective packages. Though I am >> not sure how this would work on either cases w.r.t. partitioning. e.g. on >> h2o's distributed DataFrame. or flink for that matter. Again, I havent had >> alot of time to look at these and see if this would work at all. >> >> My thought was not to bring primitive engine specific aggregetors, >> combiners, etc. into math-scala. >> >> I had thought though that we were trying to develop a fully engine >> agnostic algorithm library in on top of the R-Like distributed BLAS. >> >> >> So would the idea be to implement i.e. seq2sparse fully in the spark >> module? It would seem to fracture the project a bit. >> >> >> Or to implement algorithms sequentially if mapBlock() will not suffice >> and then optimize them in their respective modules? >> >> >> >> >>> None of that means that R-like algebra cannot be engine agnostic. So >>> people >>> are unhappy about not being able to write the whole in totaly agnostic >>> way? >>> And so they (falsely) infer the pieces of their work cannot be helped by >>> agnosticism individually, or the tools are not being as good as they >>> might >>> be without backend agnosticism? Sorry, but I fail to see the logic there. >>> >>> We proved algebra can be agnostic. I don't think this notion should be >>> disputed. >>> >>> And even if there were a shred of real benefit by making algebra tools >>> un-agnostic, it would not ever outweigh tons of good we could get for the >>> project by integrating with e.g. Flink folks. This one one the points >>> MLLib >>> will never be able to overcome -- to be truly shared ML platform where >>> people could create and share ML, but not just a bunch of ad-hoc >>> spaghetty >>> of distributed api calls and Spark-nailed black boxes. >>> >>> Well yes methodology implementations will still have native distributed >>> calls. Just not nearly as many as they otherwise would, and will be much >>> more easier to support on another back-end using Strategy patterns. E.g. >>> implicit feedback problem that i originally wrote as quasi-method for >>> Spark >>> only, would've taken just an hour or so to add strategy for flink, since >>> it >>> retains all in-core and distributed algebra work as is. >>> >>> Not to mention benefit of single type pipelining. >>> >>> And once we add hardware-accelerated bindings for in-core stuff, all >>> these >>> methods would immediately benefit from it. >>> >>> On MLLib interoperability issues, >>> ========================= >>> >>> well, let me ask you this: what it means to be MLLib-interoperable? is >>> MLLib even interoperable within itself? >>> >>> E.g. i remember there was one most frequent request on the list here: how >>> can we cluster dimensionally-reduced data? >>> >>> Let's look what it takes to do this in MLLib: First, we run tf-idf, which >>> produces collection of vectors (and where did our document ids go? not >>> sure); then we'd have to run svd or pca, both of which would accept >>> RowMatrix (bummer! but we have collection of vectors); which would >>> produce >>> RowMatrix as well but kmeans training takes RDD of vectors (bummer >>> again!). >>> >>> Not directly pluggable, although semi-trivially or trivially convertible. >>> Plus strips off information that we potentially already have computed >>> earlier in the pipeline, so we'd need to compute it again. I think >>> problem >>> is well demonstrated. >>> >>> Or, say, ALS stuff (implicit als in particular) is really an algebraic >>> problem. Should be taking input in form of matrices (that my feature >>> extraction algebraic pipeline perhaps has just prepared) but really takes >>> POJOs. Bummer again. >>> >>> So what it is exactly we should be interoperable with in this picture if >>> MLLib itself is not consistent? >>> >>> Let's look at the type system in flux there: >>> >>> we have >>> (1) collection of vectors, >>> (2) matrix of known dimensions for collection of vectors (row matrix), >>> (3) indexedRowMatrix which is matrix of known dimension with keys that >>> can >>> be _only_ long; and >>> (4) unknown but not infinitesimal amount of POJO-oriented approaches. >>> >>> But ok, let's constrain ourselves to matrix types only. >>> >>> Multitude of matrix types creates problems for tasks that require >>> consistent key propagation (like SVD or PCA or tf-idf, well demonstrated >>> in the case of mllib). In the aforementioned case of dimensionality >>> reduction over document collection, there's simply no way to propagate >>> document ids to the rows of dimensionally-reduced data. As in none at >>> all. >>> as in hard no-work-around-exists stop. >>> >>> So. There's truly no need for multiple incompatible matrix types. There >>> has >>> to be just single matrix type. Just flexible one. And everything >>> algebraic >>> needs to use it. >>> >>> And if geometry is needed, then it could be either already known or >>> lazily >>> computed, but if it is not needed, nobody bothers to compute it. (i.e. >>> truly no need And this knowledge should not be lost just because we have >>> to >>> convert between types. >>> >>> And if we want to express complex row keys such as for cluster >>> assignments >>> for example (my real case) then we could have a type with keys like >>> Tuple2(rowKeyType, cluster-string). >>> >>> And that nobody really cares if intermediate results are really be row or >>> column partitioned. >>> >>> All within single type of things. >>> >>> Bottom line, "interoperability" with mllib is both hard and trivial. >>> >>> Trivial is because whenever you need to convert, it is one line of code >>> and >>> also a trivial distributed map fusion element. (I do have pipelines >>> streaming mllib methods within DRM-based pipelines, not just >>> speculating). >>> >>> Hard is because there are so many types you may need/want to convert >>> between, so there's not much point to even try to write converters for >>> all >>> possible cases but rather go on need-to-do basis. >>> >>> It is also hard because their type system obviously continues evolving as >>> we speak. So no point chase the rabbit in the making. >>> >>> Epilogue >>> ======= >>> There's no problem with the philosophy of the distributed and >>> non-distributed algebra approach. It is incredibly useful in practice >>> and I >>> have proven it continuously (what is in public domain is just tip of the >>> iceberg). >>> >>> Rather, there's organizational anemia in the project. Like corporate >>> legal >>> interests (that includes me not being able to do quick turnaround of >>> fixes), and not having been able to tap into university resources. But i >>> don't believe in any technical philosophy problem. >>> >>> So given that aforementioned resource/logistical anemia, it will likely >>> take some when it would seem it gets worse before it gets better. But >>> afaik there are multiple efforts going on behind the curtains to break >>> red >>> tapes. so i'd just wait a bit. >>> >>> >> >> [1] https://github.com/apache/flink/blob/master/flink-java/ >> src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java >> [2] http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/ >> 5/docs-website/developuser/java.html >> >> >> >