For reasons of transparency in this discussion, I should add that I am a
committer on the upcoming Stratosphere ASF podling, co-worker of the
main developers and have contributed to it as part of my PhD.
On 04/29/2014 09:23 PM, Sebastian Schelter wrote:
Anand,
I'm trying to answer some of your questions, and my answers highlight
the points that I would like to see clarified about h20.
On 04/28/2014 11:13 PM, Anand Avati wrote:
1. Why is the DSL claiming to have (in its vision) logical vs physical
separation if not for providing multiple compute backends?
This is not a claim or a vision, the DSL already has this separation.
Take for example o.a.m.sparkbindings.drm.plan.OpAtA, thats the logical
operator for executing a Transpose-Times-Self matrix multiplication. In
o.a.m.sparkbindings.blas.AtA you will find two physical operator
implementations for that. The choice which one to use depends on whether
there is enough memory to hold certain intermediary results in memory.
The primary intention of a separation into logical and physical
operators is to allow for a declarative programming style on the users
side and for an optimizer on the system side which automatically chooses
the optimal physical operator for the execution of a specific program.
This choice of the physical operator might depend on the shape and
amount of the data processed as well on the underlying available
resources. *The separation into logical and physical operators clearly
doesn't imply to have multiple backends*. It only makes it very easy to
support them.
2. Does the proposal of having a new DSL backend in the future (for e.g
stratosphere as suggested elsewhere) make you:
-- worry that stratosphere would be a dependency to Mahout?
Stratosphere has been accepted as a incubator project in the ASF
recently, so the worry about such a dependency is naturally less than
about an externally managed project like h20.
-- worry that as a user/commiter/contributor you have to worry about a
new
framework?
In my eyes, there is a big difference between Spark/Stratosphere and
h20. Spark and Stratosphere have a clearly defined programming and
execution model. They execute programs that are composed of a DAG of
operators. The set of operators has clearly defined semantics and
parallelization strategies. If you compare their operators, you will
find that they offer pretty much the same in lightly different flavors.
For both, there are scientific papers that in detail explain all these
things.
I have asked about a detailed description of h20's programming model and
execution model and I searched the documentation, but I haven't been
able to find something that clearly describes how things are done. I
would love to read up on this, but until I'm presented with this, I have
to assume that such a principled foundation is missing.
--sebastian