Re: Codebase refactoring proposal

Dmitriy Lyubimov Wed, 04 Feb 2015 14:08:40 -0800

Spark's DataFrame is obviously not agnostic.

I don't believe there's a good way to abstract it. Unfortunately. I think
getting too much into distributed operation abstraction is a bit dangerous.


I think MLI was one project that attempted to do that -- but it did not
take off i guess. or at least there were 0 commits in like 18 months there
if i am not mistaken, and it never made it into spark tree.

So it is a good question. if we need a dataframe in flink, what do we do. I
am open to suggestions. I very much don't want to do "yet another abstract
language-integrated Spark SQL" feature.

Given resources, IMO it'd be better to take on fewer goals but make them
shine. So i'd do spark-based seq2sparse version first and that'd give some
ideas how to create ports/abstractions of that work to Flink.



On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap....@outlook.com> wrote:

>
> On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
>
>> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
>> there since they are going beyond the scope of that PR's work to chase the
>> root of the issue.
>>
>> on quasi-algebraic methods
>> ========================
>>
>> What is the dilemma here? don't see any.
>>
>> I already explained that no more than 25% of algorithms are truly 100%
>> algebraic. But about 80% cannot avoid using some algebra and close to 95%
>> could benefit from using algebra (even stochastic and monte carlo stuff).
>>
>> So we are building system that allows us to cut developer's work by at
>> least 60% and make his work also more readable by 3000%. As far as I am
>> concerned, that fulfills the goal. And I am perfectly happy writing a mix
>> of engine-specific primitives and algebra.
>>
>> That's why i am a bit skeptical about attempts to abstract non-algebraic
>> primitives such as row-wise aggregators in one of the pull requests.
>> Engine-specific primitives and algebra can perfectly co-exist in the guts.
>> And that's how i am doing my stuff in practice, except i now can skip 80%
>> effort on algebra and bridging incompatible intputs-outputs.
>>
> I am **definitely** not advocating messing with the algebraic optimizer.
> That was what I saw as the plus side to Gokhan's PR- a separate engine
> abstraction for qasi/non-algebraic distributed methods.   I didn't comment
> on the PR either because admittedly I did not have a chance to spend a lot
> of time on it.  But my quick takeaway was  that we could take some very
> useful and hopefully (close to) ubiquitous distributed operators and pass
> them through to the engine "guts".
>
> I briefly looked through some of the flink and h2o code and noticed
> Flink's aggregateOperator [1]
> and h2o's MapReduce API and [2] my thought was that we could write pass
> through operators for some of the more useful operations from math-scala
> and then implement them fully in their respective packages.  Though I am
> not sure how this would work on either cases w.r.t. partitioning.  e.g. on
> h2o's distributed DataFrame. or flink for that matter.  Again, I havent had
> alot of time to look at these and see if this would work at all.
>
> My thought was not to bring primitive engine specific aggregetors,
> combiners,  etc. into math-scala.
>
> I had thought though that we were trying to develop a fully engine
> agnostic algorithm library in on top of the R-Like distributed BLAS.
>
>
> So would the idea be to implement i.e. seq2sparse fully in the spark
> module?  It would seem to fracture the project a bit.
>
>
> Or to implement algorithms sequentially if mapBlock() will not suffice and
> then optimize them in their respective modules?
>
>
>
>
>> None of that means that R-like algebra cannot be engine agnostic. So
>> people
>> are unhappy about not being able to write the whole in totaly agnostic
>> way?
>> And so they (falsely) infer the pieces of their work cannot be helped by
>> agnosticism individually, or the tools are not being as good as they might
>> be without backend agnosticism? Sorry, but I fail to see the logic there.
>>
>> We proved algebra can be agnostic. I don't think this notion should be
>> disputed.
>>
>> And even if there were a shred of real benefit by making algebra tools
>> un-agnostic, it would not ever outweigh tons of good we could get for the
>> project by integrating with e.g. Flink folks. This one one the points
>> MLLib
>> will never be able to overcome -- to be truly shared ML platform where
>> people could create and share ML, but not just a bunch of ad-hoc spaghetty
>> of distributed api calls and Spark-nailed black boxes.
>>
>> Well yes methodology implementations will still have native distributed
>> calls. Just not nearly as many as they otherwise would, and will be much
>> more easier to support on another back-end using Strategy patterns. E.g.
>> implicit feedback problem that i originally wrote as quasi-method for
>> Spark
>> only, would've taken just an hour or so to add strategy for flink, since
>> it
>> retains all in-core and distributed algebra work as is.
>>
>> Not to mention benefit of single type pipelining.
>>
>> And once we add hardware-accelerated bindings for in-core stuff, all these
>> methods would immediately benefit from it.
>>
>> On MLLib interoperability issues,
>> =========================
>>
>> well, let me ask you this: what it means to be MLLib-interoperable? is
>> MLLib even interoperable within itself?
>>
>> E.g. i remember there was one most frequent request on the list here: how
>> can we cluster dimensionally-reduced data?
>>
>> Let's look what it takes to do this in MLLib: First, we run tf-idf, which
>> produces collection of vectors (and where did our document ids go? not
>> sure); then we'd have to run svd or pca, both of which would accept
>> RowMatrix (bummer! but we have collection of vectors); which would produce
>> RowMatrix as well but kmeans training takes RDD of vectors (bummer
>> again!).
>>
>> Not directly pluggable, although semi-trivially or trivially convertible.
>> Plus strips off information that we potentially already have computed
>> earlier in the pipeline, so we'd need to compute it again. I think problem
>> is well demonstrated.
>>
>> Or, say, ALS stuff (implicit als in particular) is really an algebraic
>> problem. Should be taking input in form of matrices (that my feature
>> extraction algebraic pipeline perhaps has just prepared) but really takes
>> POJOs. Bummer again.
>>
>> So what it is exactly we should be interoperable with in this picture if
>> MLLib itself is not consistent?
>>
>> Let's look at the type system in flux there:
>>
>> we have
>> (1) collection of vectors,
>> (2) matrix of known dimensions for collection of vectors (row matrix),
>> (3) indexedRowMatrix which is matrix of known dimension with keys that can
>> be _only_ long; and
>> (4) unknown but not infinitesimal amount of POJO-oriented approaches.
>>
>> But ok, let's constrain ourselves to matrix types only.
>>
>> Multitude of matrix types creates problems for tasks that require
>> consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
>> in the case of mllib). In the aforementioned case of dimensionality
>> reduction over document collection, there's simply no way to propagate
>> document ids to the rows of dimensionally-reduced data. As in none at all.
>> as in hard no-work-around-exists stop.
>>
>> So. There's truly no need for multiple incompatible matrix types. There
>> has
>> to be just single matrix type. Just flexible one. And everything algebraic
>> needs to use it.
>>
>> And if geometry is needed, then it could be either already known or lazily
>> computed, but if it is not needed, nobody bothers to compute it. (i.e.
>> truly no need And this knowledge should not be lost just because we have
>> to
>> convert between types.
>>
>> And if we want to express complex row keys such as for cluster assignments
>> for example (my real case) then we could have a type with keys like
>> Tuple2(rowKeyType, cluster-string).
>>
>> And that nobody really cares if intermediate results are really be row or
>> column partitioned.
>>
>> All within single type of things.
>>
>> Bottom line, "interoperability" with mllib is both hard and trivial.
>>
>> Trivial is because whenever you need to convert, it is one line of code
>> and
>> also a trivial distributed map fusion element. (I do have pipelines
>> streaming mllib methods within DRM-based pipelines, not just speculating).
>>
>> Hard is because there are so many types you may need/want to convert
>> between, so there's not much point to even try to write converters for all
>> possible cases but rather go on need-to-do basis.
>>
>> It is also hard because their type system obviously continues evolving as
>> we speak. So no point chase the rabbit in the making.
>>
>> Epilogue
>> =======
>> There's no problem with the philosophy of the distributed and
>> non-distributed algebra approach. It is incredibly useful in practice and
>> I
>> have proven it continuously (what is in public domain is just tip of the
>> iceberg).
>>
>> Rather, there's organizational anemia in the project. Like corporate legal
>> interests (that includes me not being able to do quick turnaround of
>> fixes), and not having been able to tap into university resources. But i
>> don't believe in any technical philosophy problem.
>>
>> So given that aforementioned resource/logistical anemia, it will likely
>> take some when it would seem it  gets worse  before it gets better. But
>> afaik there are multiple efforts going on behind the curtains to break red
>> tapes. so i'd just wait a bit.
>>
>>
>
> [1] https://github.com/apache/flink/blob/master/flink-java/
> src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java
> [2] http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/
> 5/docs-website/developuser/java.html
>
>
>

Re: Codebase refactoring proposal

Reply via email to