Hi! To follow up on what Ufuk explaned:
- Ufuk is right, the problem is not getting the data set. https://github.com/apache/flink/pull/210 does that for anything that is not too gigantic, which is a good start. I think we should merge this as soon as we agree on the signature and names of the API methods. We can swap the internal realization for something more robust later. - For anything that just issues a program and wants the result back, this is actually perfectly fine. - For true interactive programs, we need to back track to intermediate results (rather than to the source) to avoid re-executing large parts. This is the biggest missing piece, next to the persistent materialization of intermediate results (Ufuk is working on this). The logic is the same as for fault tolerance, so it is part of that development. @alexander: I want to create the feature branch for that on Thursday. Are you interested in contributing to that feature? - For streaming results continuously back, we need another mechanism than the accumulators. Let's create a design doc or thread an get working on that. Probably involves adding another set of akka messages from TM -> JM -> Client. Or something like an extension to the BLOB manager for streams? Greetings, Stephan On Mon, Jan 12, 2015 at 12:25 PM, Alexander Alexandrov < alexander.s.alexand...@gmail.com> wrote: > Thanks, I am currently looking at the new ExecutionEnvironment API. > > > I think Stephan is working on the scheduling to support this kind of > programs. > > @Stephan: is there a feature branch for that somewhere? > > 2015-01-12 12:05 GMT+01:00 Ufuk Celebi <u...@apache.org>: > > > Hey Alexander, > > > > On 12 Jan 2015, at 11:42, Alexander Alexandrov < > > alexander.s.alexand...@gmail.com> wrote: > > > > > Hi there, > > > > > > I wished for intermediate datasets, and Santa Ufuk made my wishes come > > true > > > (thank you, Santa)! > > > > > > Now that FLINK-986 is in the mainline, I want to ask some practical > > > questions. > > > > > > In Spark, there is a way to put a value from the local driver to the > > > distributed runtime via > > > > > > val x = env.parallelize(...) // expose x to the distributed runtime > > > val y = dataflow(env, x) // y is produced by a dataflow which reads > from > > x > > > > > > and also to get a value from the distributed runtime back to the driver > > > > > > val z = env.collect("y") > > > > > > As far as I know, in Flink we have an equivalent for parallelize > > > > > > val x = env.fromCollection(...) > > > > > > but not for collect. Is this still the case? > > > > Yes, but this will change soon. > > > > > If yes, how hard would it be to add this feature at the moment? Can you > > > give me some pointers? > > > > There is a "alpha" version/hack of this using accumulators. See > > https://github.com/apache/flink/pull/210. The problem is that each > > collect call results in a new program being executed from the sources. I > > think Stephan is working on the scheduling to support this kind of > > programs. From the runtime perspective, it is not a problem to transfer > the > > produced intermediate results back to the job manager. The job manager > can > > basically use the same mechanism that the task managers use. Even the > > accumulator version should be fine as a initial version. >