At the risk stepping into a maelstrom where I don't belong, let me
answer some of these:
On 4/27/2014 2:42 PM, Dmitriy Lyubimov (JIRA) wrote:
*(C)*
...because x2o programming model is not rich enough to provide things like
zipping identically distributed datasets,
We do, and it's "free" - a pointer-copy only. Distribution in H2O is
called a "VectorGroup", and 2 Vecs in the same VectorGroup will have
equal distribution. Zipping them is as easy as: "new Vec[]{vec1,vec2}".
very general shuffle model (e.g. many-to-many shuffle),
Again, we do - although by it's very design this operation mode is
expensive for everybody - it implies at least O(n) general
communication, sometimes O(n^2).
We try to provide tools to allow people to avoid general shuffles, but
if they want it - it's easily available.
advanced partition management (shuffless resplit-coalesce), and so on.
We are trying very hard to do good partition management "under the hood"
and never expose it.
If we have to expose the partitions, then I think this is a sign of a
broken API - although I'm willing to be convinced otherwise given a some
cases where a user-rolled partition hack beats what we're doing "under
the hood". So far we've seen only one such case (forced random shuffle
of chunk-size granularity), and we're folding it back into the basic engine.
I am not even sure if there's a clear concept of combiner type operation.
If by "combiner type" operation, you mean what Mahout calls
"aggregations" - then of course yes we totally support aggregations -
our Map/Reduce paradigm is exactly a aggregation implemented with
generic Java code.
Cliff