Re: Gather a distributed dataset

Stephan Ewen Tue, 13 Jan 2015 07:52:57 -0800

Hi!

To follow up on what Ufuk explaned:


 - Ufuk is right, the problem is not getting the data set.
https://github.com/apache/flink/pull/210 does that for anything that is not
too gigantic, which is a good start. I think we should merge this as soon
as we agree on the signature and names of the API methods. We can swap the
internal realization for something more robust later.

 - For anything that just issues a program and wants the result back, this
is actually perfectly fine.

 - For true interactive programs, we need to back track to intermediate
results (rather than to the source) to avoid re-executing large parts. This
is the biggest missing piece, next to the persistent materialization of
intermediate results (Ufuk is working on this). The logic is the same as
for fault tolerance, so it is part of that development.

@alexander: I want to create the feature branch for that on Thursday. Are
you interested in contributing to that feature?

 - For streaming results continuously back, we need another mechanism than
the accumulators. Let's create a design doc or thread an get working on
that. Probably involves adding another set of akka messages from TM -> JM
-> Client. Or something like an extension to the BLOB manager for streams?

Greetings,
Stephan


On Mon, Jan 12, 2015 at 12:25 PM, Alexander Alexandrov <
alexander.s.alexand...@gmail.com> wrote:

> Thanks, I am currently looking at the new ExecutionEnvironment API.
>
> > I think Stephan is working on the scheduling to support this kind of
> programs.
>
> @Stephan: is there a feature branch for that somewhere?
>
> 2015-01-12 12:05 GMT+01:00 Ufuk Celebi <u...@apache.org>:
>
> > Hey Alexander,
> >
> > On 12 Jan 2015, at 11:42, Alexander Alexandrov <
> > alexander.s.alexand...@gmail.com> wrote:
> >
> > > Hi there,
> > >
> > > I wished for intermediate datasets, and Santa Ufuk made my wishes come
> > true
> > > (thank you, Santa)!
> > >
> > > Now that FLINK-986 is in the mainline, I want to ask some practical
> > > questions.
> > >
> > > In Spark, there is a way to put a value from the local driver to the
> > > distributed runtime via
> > >
> > > val x = env.parallelize(...) // expose x to the distributed runtime
> > > val y = dataflow(env, x) // y is produced by a dataflow which reads
> from
> > x
> > >
> > > and also to get a value from the distributed runtime back to the driver
> > >
> > > val z = env.collect("y")
> > >
> > > As far as I know, in Flink we have an equivalent for parallelize
> > >
> > > val x = env.fromCollection(...)
> > >
> > > but not for collect. Is this still the case?
> >
> > Yes, but this will change soon.
> >
> > > If yes, how hard would it be to add this feature at the moment? Can you
> > > give me some pointers?
> >
> > There is a "alpha" version/hack of this using accumulators. See
> > https://github.com/apache/flink/pull/210. The problem is that each
> > collect call results in a new program being executed from the sources. I
> > think Stephan is working on the scheduling to support this kind of
> > programs. From the runtime perspective, it is not a problem to transfer
> the
> > produced intermediate results back to the job manager. The job manager
> can
> > basically use the same mechanism that the task managers use. Even the
> > accumulator version should be fine as a initial version.
>

Re: Gather a distributed dataset

Reply via email to