Re: Codebase refactoring proposal

Dmitriy Lyubimov Wed, 04 Feb 2015 14:12:11 -0800

But also keep in mind that Flink folks are eager to allocate resources for
ML work. So maybe that's the way to work it -- create a DataFrame-based
seq2sparse port and then just hand it off to them to add to either Flink
directly (but with DRM output), or as a part of Mahout.


On Wed, Feb 4, 2015 at 2:07 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> Spark's DataFrame is obviously not agnostic.
>
> I don't believe there's a good way to abstract it. Unfortunately. I think
> getting too much into distributed operation abstraction is a bit dangerous.
>
> I think MLI was one project that attempted to do that -- but it did not
> take off i guess. or at least there were 0 commits in like 18 months there
> if i am not mistaken, and it never made it into spark tree.
>
> So it is a good question. if we need a dataframe in flink, what do we do.
> I am open to suggestions. I very much don't want to do "yet another
> abstract language-integrated Spark SQL" feature.
>
> Given resources, IMO it'd be better to take on fewer goals but make them
> shine. So i'd do spark-based seq2sparse version first and that'd give some
> ideas how to create ports/abstractions of that work to Flink.
>
>
>
> On Wed, Feb 4, 2015 at 1:51 PM, Andrew Palumbo <ap....@outlook.com> wrote:
>
>>
>> On 02/04/2015 03:37 PM, Dmitriy Lyubimov wrote:
>>
>>> Re: Gokhan's PR post: here are my thoughts but i did not want to post it
>>> there since they are going beyond the scope of that PR's work to chase
>>> the
>>> root of the issue.
>>>
>>> on quasi-algebraic methods
>>> ========================
>>>
>>> What is the dilemma here? don't see any.
>>>
>>> I already explained that no more than 25% of algorithms are truly 100%
>>> algebraic. But about 80% cannot avoid using some algebra and close to 95%
>>> could benefit from using algebra (even stochastic and monte carlo stuff).
>>>
>>> So we are building system that allows us to cut developer's work by at
>>> least 60% and make his work also more readable by 3000%. As far as I am
>>> concerned, that fulfills the goal. And I am perfectly happy writing a mix
>>> of engine-specific primitives and algebra.
>>>
>>> That's why i am a bit skeptical about attempts to abstract non-algebraic
>>> primitives such as row-wise aggregators in one of the pull requests.
>>> Engine-specific primitives and algebra can perfectly co-exist in the
>>> guts.
>>> And that's how i am doing my stuff in practice, except i now can skip 80%
>>> effort on algebra and bridging incompatible intputs-outputs.
>>>
>> I am **definitely** not advocating messing with the algebraic optimizer.
>> That was what I saw as the plus side to Gokhan's PR- a separate engine
>> abstraction for qasi/non-algebraic distributed methods.   I didn't comment
>> on the PR either because admittedly I did not have a chance to spend a lot
>> of time on it.  But my quick takeaway was  that we could take some very
>> useful and hopefully (close to) ubiquitous distributed operators and pass
>> them through to the engine "guts".
>>
>> I briefly looked through some of the flink and h2o code and noticed
>> Flink's aggregateOperator [1]
>> and h2o's MapReduce API and [2] my thought was that we could write pass
>> through operators for some of the more useful operations from math-scala
>> and then implement them fully in their respective packages.  Though I am
>> not sure how this would work on either cases w.r.t. partitioning.  e.g. on
>> h2o's distributed DataFrame. or flink for that matter.  Again, I havent had
>> alot of time to look at these and see if this would work at all.
>>
>> My thought was not to bring primitive engine specific aggregetors,
>> combiners,  etc. into math-scala.
>>
>> I had thought though that we were trying to develop a fully engine
>> agnostic algorithm library in on top of the R-Like distributed BLAS.
>>
>>
>> So would the idea be to implement i.e. seq2sparse fully in the spark
>> module?  It would seem to fracture the project a bit.
>>
>>
>> Or to implement algorithms sequentially if mapBlock() will not suffice
>> and then optimize them in their respective modules?
>>
>>
>>
>>
>>> None of that means that R-like algebra cannot be engine agnostic. So
>>> people
>>> are unhappy about not being able to write the whole in totaly agnostic
>>> way?
>>> And so they (falsely) infer the pieces of their work cannot be helped by
>>> agnosticism individually, or the tools are not being as good as they
>>> might
>>> be without backend agnosticism? Sorry, but I fail to see the logic there.
>>>
>>> We proved algebra can be agnostic. I don't think this notion should be
>>> disputed.
>>>
>>> And even if there were a shred of real benefit by making algebra tools
>>> un-agnostic, it would not ever outweigh tons of good we could get for the
>>> project by integrating with e.g. Flink folks. This one one the points
>>> MLLib
>>> will never be able to overcome -- to be truly shared ML platform where
>>> people could create and share ML, but not just a bunch of ad-hoc
>>> spaghetty
>>> of distributed api calls and Spark-nailed black boxes.
>>>
>>> Well yes methodology implementations will still have native distributed
>>> calls. Just not nearly as many as they otherwise would, and will be much
>>> more easier to support on another back-end using Strategy patterns. E.g.
>>> implicit feedback problem that i originally wrote as quasi-method for
>>> Spark
>>> only, would've taken just an hour or so to add strategy for flink, since
>>> it
>>> retains all in-core and distributed algebra work as is.
>>>
>>> Not to mention benefit of single type pipelining.
>>>
>>> And once we add hardware-accelerated bindings for in-core stuff, all
>>> these
>>> methods would immediately benefit from it.
>>>
>>> On MLLib interoperability issues,
>>> =========================
>>>
>>> well, let me ask you this: what it means to be MLLib-interoperable? is
>>> MLLib even interoperable within itself?
>>>
>>> E.g. i remember there was one most frequent request on the list here: how
>>> can we cluster dimensionally-reduced data?
>>>
>>> Let's look what it takes to do this in MLLib: First, we run tf-idf, which
>>> produces collection of vectors (and where did our document ids go? not
>>> sure); then we'd have to run svd or pca, both of which would accept
>>> RowMatrix (bummer! but we have collection of vectors); which would
>>> produce
>>> RowMatrix as well but kmeans training takes RDD of vectors (bummer
>>> again!).
>>>
>>> Not directly pluggable, although semi-trivially or trivially convertible.
>>> Plus strips off information that we potentially already have computed
>>> earlier in the pipeline, so we'd need to compute it again. I think
>>> problem
>>> is well demonstrated.
>>>
>>> Or, say, ALS stuff (implicit als in particular) is really an algebraic
>>> problem. Should be taking input in form of matrices (that my feature
>>> extraction algebraic pipeline perhaps has just prepared) but really takes
>>> POJOs. Bummer again.
>>>
>>> So what it is exactly we should be interoperable with in this picture if
>>> MLLib itself is not consistent?
>>>
>>> Let's look at the type system in flux there:
>>>
>>> we have
>>> (1) collection of vectors,
>>> (2) matrix of known dimensions for collection of vectors (row matrix),
>>> (3) indexedRowMatrix which is matrix of known dimension with keys that
>>> can
>>> be _only_ long; and
>>> (4) unknown but not infinitesimal amount of POJO-oriented approaches.
>>>
>>> But ok, let's constrain ourselves to matrix types only.
>>>
>>> Multitude of matrix types creates problems for tasks that require
>>> consistent key propagation (like  SVD or PCA or tf-idf, well demonstrated
>>> in the case of mllib). In the aforementioned case of dimensionality
>>> reduction over document collection, there's simply no way to propagate
>>> document ids to the rows of dimensionally-reduced data. As in none at
>>> all.
>>> as in hard no-work-around-exists stop.
>>>
>>> So. There's truly no need for multiple incompatible matrix types. There
>>> has
>>> to be just single matrix type. Just flexible one. And everything
>>> algebraic
>>> needs to use it.
>>>
>>> And if geometry is needed, then it could be either already known or
>>> lazily
>>> computed, but if it is not needed, nobody bothers to compute it. (i.e.
>>> truly no need And this knowledge should not be lost just because we have
>>> to
>>> convert between types.
>>>
>>> And if we want to express complex row keys such as for cluster
>>> assignments
>>> for example (my real case) then we could have a type with keys like
>>> Tuple2(rowKeyType, cluster-string).
>>>
>>> And that nobody really cares if intermediate results are really be row or
>>> column partitioned.
>>>
>>> All within single type of things.
>>>
>>> Bottom line, "interoperability" with mllib is both hard and trivial.
>>>
>>> Trivial is because whenever you need to convert, it is one line of code
>>> and
>>> also a trivial distributed map fusion element. (I do have pipelines
>>> streaming mllib methods within DRM-based pipelines, not just
>>> speculating).
>>>
>>> Hard is because there are so many types you may need/want to convert
>>> between, so there's not much point to even try to write converters for
>>> all
>>> possible cases but rather go on need-to-do basis.
>>>
>>> It is also hard because their type system obviously continues evolving as
>>> we speak. So no point chase the rabbit in the making.
>>>
>>> Epilogue
>>> =======
>>> There's no problem with the philosophy of the distributed and
>>> non-distributed algebra approach. It is incredibly useful in practice
>>> and I
>>> have proven it continuously (what is in public domain is just tip of the
>>> iceberg).
>>>
>>> Rather, there's organizational anemia in the project. Like corporate
>>> legal
>>> interests (that includes me not being able to do quick turnaround of
>>> fixes), and not having been able to tap into university resources. But i
>>> don't believe in any technical philosophy problem.
>>>
>>> So given that aforementioned resource/logistical anemia, it will likely
>>> take some when it would seem it  gets worse  before it gets better. But
>>> afaik there are multiple efforts going on behind the curtains to break
>>> red
>>> tapes. so i'd just wait a bit.
>>>
>>>
>>
>> [1] https://github.com/apache/flink/blob/master/flink-java/
>> src/main/java/org/apache/flink/api/java/operators/AggregateOperator.java
>> [2] http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/
>> 5/docs-website/developuser/java.html
>>
>>
>>
>

Re: Codebase refactoring proposal

Reply via email to