This is a difficult question. A program might also later refer to some intermediate data set that would have been already computed if sinks are executed together with the count() call and need to be computed again. Also what do we do with sinks that are not connected with the collected or counted data set. Executing them as well would not give any benefit but might result in higher costs, if something on their path is used later. On the other hand distinguishing between connected and unconnected sinks might not be easy to reason about.
How about caching data sets where the data flow branches? In any case, the policy should be easy to understand for users. Cheers, Fabian 2015-01-19 11:19 GMT+01:00 Ufuk Celebi <u...@apache.org>: > I think this question depends on how much both subgraphs overlap? But in > general, I agree that the first approach seems more desirable from the > runtime view (multiple consumers at the branch point). > > On Mon, Jan 19, 2015 at 10:59 AM, Robert Metzger <rmetz...@apache.org> > wrote: > > > I would also execute the sinks immediately. I think its a corner case > > because the sinks are usually the last thing in a plan and all print() or > > collect() statements are earlier in the plan. > > > > print() should go to the client command line, yes. > > > > On Mon, Jan 19, 2015 at 1:42 AM, Stephan Ewen <se...@apache.org> wrote: > > > > > Hi there! > > > > > > With the upcoming more interactive extensions to the API (operations > that > > > go back to the client from a program and need to be eagerly evaluated) > we > > > need to define how different actions should behave. > > > > > > Currently, nothing gets executed until the "env.execute()" call is > made. > > > That allows to produce multiple data sources at the same time, which > is a > > > good feature. > > > > > > For certain operations, like the "count()" and "collect()" functions > > added > > > in https://github.com/apache/flink/pull/210 , we need to trigger > > execution > > > immediately. > > > > > > The open question is, how should this behave in connection with already > > > defined data sinks: > > > > > > 1) Should all yet defined data sinks be executed as well? > > > 2) Should only that immediate operation be executed and the data sinks > be > > > pending till a call to "env.execute()" > > > > > > I am somewhat leaning towards the first option right now, because I > think > > > that executing them later may force re-execution of larger parts of the > > > plan. > > > > > > In addition: I think that the "print()" commands should go to the > client > > > command line. In that sense, they would behave like > > > "collect().foreach(print)" > > > > > > > > > Greetings, > > > Stephan > > > > > >