Re: [DISCUSS] Support Interactive Programming in Flink Table API

Fabian Hueske Thu, 29 Nov 2018 05:17:15 -0800

Hi,

Thanks for the clarification Becket!


I have a few thoughts to share / questions:

1) I'd like to know how you plan to implement the feature on a plan /
planner level.

I would imaging the following to happen when Table.cache() is called:

1) immediately optimize the Table and internally convert it into a
DataSet/DataStream. This is necessary, to avoid that operators of later
queries on top of the Table are pushed down.
2) register the DataSet/DataStream as a DataSet/DataStream-backed Table X
3) add a sink to the DataSet/DataStream. This is the materialization of the
Table X

Based on your proposal the following would happen:

Table t1 = ....
t1.cache(); // cache() returns void. The logical plan of t1 is replaced by
a scan of X. There is also a reference to the materialization of X.

t1.count(); // this executes the program, including the DataSet/DataStream
that backs X and the sink that writes the materialization of X
t1.count(); // this executes the program, but reads X from the
materialization.

My question is, how do you determine when whether the scan of t1 should go
against the DataSet/DataStream program and when against the materialization?
AFAIK, there is no hook that will tell you that a part of the program was
executed. Flipping a switch during optimization or plan generation is not
sufficient as there is no guarantee that the plan is also executed.

Overall, this behavior is somewhat similar to what I proposed in
FLINK-8950, which does not include persisting the table, but just
optimizing and reregistering it as DataSet/DataStream scan.

2) I think Piotr has a point about the implicit behavior and side effects
of the cache() method if it does not return anything.
Consider the following example:

Table t1 = ???
Table t2 = methodThatAppliesOperators(t1);
Table t3 = methodThatAppliesOtherOperators(t1);

In this case, the behavior/performance of the plan that results from the
second method call depends on whether t1 was modified by the first method
or not.
This is the classic issue of mutable vs. immutable objects.
Also, as Piotr pointed out, it might also be good to have the original plan
of t1, because in some cases it is possible to push filters down such that
evaluating the query from scratch might be more efficient than accessing
the cache.
Moreover, a CachedTable could extend Table() and offer a method refresh().
This sounds quite useful in an interactive session mode.

3) Regarding the name, I can see both arguments. IMO, materialize() seems
to be more future proof.

Best, Fabian

Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
wshaox...@gmail.com>:

> Hi Piotr,
>
> Thanks for sharing your ideas on the method naming. We will think about
> your suggestions. But I don't understand why we need to change the return
> type of cache().
>
> Cache() is a physical operation, it does not change the logic of
> the `Table`. On the tableAPI layer, we should not introduce a new table
> type unless the logic of table has been changed. If we introduce a new
> table type `CachedTable`, we need create the same set of methods of `Table`
> for it. I don't think it is worth doing this. Or can you please elaborate
> more on what could be the "implicit behaviours/side effects" you are
> thinking about?
>
> Regards,
> Shaoxuan
>
>
>
> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
>
> > Hi Becket,
> >
> > Thanks for the response.
> >
> > 1. I wasn’t saying that materialised view must be mutable or not. The
> same
> > thing applies to caches as well. To the contrary, I would expect more
> > consistency and updates from something that is called “cache” vs
> something
> > that’s a “materialised view”. In other words, IMO most caches do not
> serve
> > you invalid/outdated data and they handle updates on their own.
> >
> > 2. I don’t think that having in the future two very similar concepts of
> > `materialized` view and `cache` is a good idea. It would be confusing for
> > the users. I think it could be handled by variations/overloading of
> > materialised view concept. We could start with:
> >
> > `MaterializedTable materialize()` - immutable, session life scope
> > (basically the same semantic as you are proposing
> >
> > And then in the future (if ever) build on top of that/expand it with:
> >
> > `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
> > materialize(refreshHook=…)`
> >
> > Or with cross session support:
> >
> > `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
> > materializeInto(tableFactory=…)`
> >
> > I’m not saying that we should implement cross session/refreshing now or
> > even in the near future. I’m just arguing that naming current immutable
> > session life scope method `materialize()` is more future proof and more
> > consistent with SQL (on which after all table-api is heavily basing on).
> >
> > 3. Even if we agree on naming it `cache()`, I would still insist on
> > `cache()` returning `CachedTable` handle to avoid implicit
> behaviours/side
> > effects and to give both us & users more flexibility.
> >
> > Piotrek
> >
> > > On 29 Nov 2018, at 06:20, Becket Qin <becket....@gmail.com> wrote:
> > >
> > > Just to add a little bit, the materialized view is probably more
> similar
> > to
> > > the persistent() brought up earlier in the thread. So it is usually
> cross
> > > session and could be used in a larger scope. For example, a
> materialized
> > > view created by user A may be visible to user B. It is probably
> something
> > > we want to have in the future. I'll put it in the future work section.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <becket....@gmail.com>
> wrote:
> > >
> > >> Hi Piotrek,
> > >>
> > >> Thanks for the explanation.
> > >>
> > >> Right now we are mostly thinking of the cached table as immutable. I
> can
> > >> see the Materialized view would be useful in the future. That said, I
> > think
> > >> a simple cache mechanism is probably still needed. So to me, cache()
> and
> > >> materialize() should be two separate method as they address different
> > >> needs. Materialize() is a higher level concept usually implying
> > periodical
> > >> update, while cache() has much simpler semantic. For example, one may
> > >> create a materialized view and use cache() method in the materialized
> > view
> > >> creation logic. So that during the materialized view update, they do
> not
> > >> need to worry about the case that the cached table is also changed.
> > Maybe
> > >> under the hood, materialized() and cache() could share some mechanism,
> > but
> > >> I think a simple cache() method would be handy in a lot of cases.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
> pi...@data-artisans.com
> > >
> > >> wrote:
> > >>
> > >>> Hi Becket,
> > >>>
> > >>>> Is there any extra thing user can do on a MaterializedTable that
> they
> > >>> cannot do on a Table?
> > >>>
> > >>> Maybe not in the initial implementation, but various DBs offer
> > different
> > >>> ways to “refresh” the materialised view. Hooks, triggers, timers,
> > manually
> > >>> etc. Having `MaterializedTable` would help us to handle that in the
> > future.
> > >>>
> > >>>> After users call *table.cache(), *users can just use that table and
> do
> > >>> anything that is supported on a Table, including SQL.
> > >>>
> > >>> This is some implicit behaviour with side effects. Imagine if user
> has
> > a
> > >>> long and complicated program, that touches table `b` multiple times,
> > maybe
> > >>> scattered around different methods. If he modifies his program by
> > inserting
> > >>> in one place
> > >>>
> > >>> b.cache()
> > >>>
> > >>> This implicitly alters the semantic and behaviour of his code all
> over
> > >>> the place, maybe in a ways that might cause problems. For example
> what
> > if
> > >>> underlying data is changing?
> > >>>
> > >>> Having invisible side effects is also not very clean, for example
> think
> > >>> about something like this (but more complicated):
> > >>>
> > >>> Table b = ...;
> > >>>
> > >>> If (some_condition) {
> > >>>  processTable1(b)
> > >>> }
> > >>> else {
> > >>>  processTable2(b)
> > >>> }
> > >>>
> > >>> // do more stuff with b
> > >>>
> > >>> And user adds `b.cache()` call to only one of the `processTable1` or
> > >>> `processTable2` methods.
> > >>>
> > >>> On the other hand
> > >>>
> > >>> Table materialisedB = b.materialize()
> > >>>
> > >>> Avoids (at least some of) the side effect issues and forces user to
> > >>> explicitly use `materialisedB` where it’s appropriate and forces user
> > to
> > >>> think what does it actually mean. And if something doesn’t work in
> the
> > end
> > >>> for the user, he will know what has he changed instead of blaming
> > Flink for
> > >>> some “magic” underneath. In the above example, after materialising b
> in
> > >>> only one of the methods, he should/would realise about the issue when
> > >>> handling the return value `MaterializedTable` of that method.
> > >>>
> > >>> I guess it comes down to personal preferences if you like things to
> be
> > >>> implicit or not. The more power is the user, probably the more likely
> > he is
> > >>> to like/understand implicit behaviour. And we as Table API designers
> > are
> > >>> the most power users out there, so I would proceed with caution (so
> > that we
> > >>> do not end up in the crazy perl realm with it’s lovely implicit
> method
> > >>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
> > >>>
> > >>>> Table API to also support non-relational processing cases, cache()
> > >>> might be slightly better.
> > >>>
> > >>> I think even such extended Table API could benefit from sticking
> > to/being
> > >>> consistent with SQL where both SQL and Table API are basically the
> > same.
> > >>>
> > >>> One more thing. `MaterializedTable materialize()` could be more
> > >>> powerful/flexible allowing the user to operate both on materialised
> > and not
> > >>> materialised view at the same time for whatever reasons (underlying
> > data
> > >>> changing/better optimisation opportunities after pushing down more
> > filters
> > >>> etc). For example:
> > >>>
> > >>> Table b = …;
> > >>>
> > >>> MaterlizedTable mb = b.materialize();
> > >>>
> > >>> Val min = mb.min();
> > >>> Val max = mb.max();
> > >>>
> > >>> Val user42 = b.filter(‘userId = 42);
> > >>>
> > >>> Could be more efficient compared to `b.cache()` if `filter(‘userId =
> > >>> 42);` allows for much more aggressive optimisations.
> > >>>
> > >>> Piotrek
> > >>>
> > >>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fhue...@gmail.com> wrote:
> > >>>>
> > >>>> I'm not suggesting to add support for Ignite. This was just an
> > example.
> > >>>> Plasma and Arrow sound interesting, too.
> > >>>> For the sake of this proposal, it would be up to the user to
> > implement a
> > >>>> TableFactory and corresponding TableSource / TableSink classes to
> > >>> persist
> > >>>> and read the data.
> > >>>>
> > >>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
> > >>>> pomperma...@okkam.it>:
> > >>>>
> > >>>>> What about to add also Apache Plasma + Arrow as an alternative to
> > >>> Apache
> > >>>>> Ignite?
> > >>>>> [1]
> > >>>>>
> > >>>
> > https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
> > >>>>>
> > >>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <fhue...@gmail.com>
> > >>> wrote:
> > >>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> Thanks for the proposal!
> > >>>>>>
> > >>>>>> To summarize, you propose a new method Table.cache(): Table that
> > will
> > >>>>>> trigger a job and write the result into some temporary storage as
> > >>> defined
> > >>>>>> by a TableFactory.
> > >>>>>> The cache() call blocks while the job is running and eventually
> > >>> returns a
> > >>>>>> Table object that represents a scan of the temporary table.
> > >>>>>> When the "session" is closed (closing to be defined?), the
> temporary
> > >>>>> tables
> > >>>>>> are all dropped.
> > >>>>>>
> > >>>>>> I think this behavior makes sense and is a good first step towards
> > >>> more
> > >>>>>> interactive workloads.
> > >>>>>> However, its performance suffers from writing to and reading from
> > >>>>> external
> > >>>>>> systems.
> > >>>>>> I think this is OK for now. Changes that would significantly
> improve
> > >>> the
> > >>>>>> situation (i.e., pinning data in-memory across jobs) would have
> > large
> > >>>>>> impacts on many components of Flink.
> > >>>>>> Users could use in-memory filesystems or storage grids (Apache
> > >>> Ignite) to
> > >>>>>> mitigate some of the performance effects.
> > >>>>>>
> > >>>>>> Best, Fabian
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
> > >>>>>> becket....@gmail.com
> > >>>>>>> :
> > >>>>>>
> > >>>>>>> Thanks for the explanation, Piotrek.
> > >>>>>>>
> > >>>>>>> Is there any extra thing user can do on a MaterializedTable that
> > they
> > >>>>>>> cannot do on a Table? After users call *table.cache(), *users can
> > >>> just
> > >>>>>> use
> > >>>>>>> that table and do anything that is supported on a Table,
> including
> > >>> SQL.
> > >>>>>>>
> > >>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
> > >>> cache()
> > >>>>>> is
> > >>>>>>> a bit more general than materialize(). Given that we are
> enhancing
> > >>> the
> > >>>>>>> Table API to also support non-relational processing cases,
> cache()
> > >>>>> might
> > >>>>>> be
> > >>>>>>> slightly better.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>>
> > >>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
> > >>>>> pi...@data-artisans.com
> > >>>>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi Becket,
> > >>>>>>>>
> > >>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
> > >>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want to
> > >>>>>> provide
> > >>>>>>> an
> > >>>>>>>> alternate way of writing the data.
> > >>>>>>>>
> > >>>>>>>> Now that I hopefully understand the proposal, maybe we could
> > rename
> > >>>>>>>> `cache()` to
> > >>>>>>>>
> > >>>>>>>> void materialize()
> > >>>>>>>>
> > >>>>>>>> or going step further
> > >>>>>>>>
> > >>>>>>>> MaterializedTable materialize()
> > >>>>>>>> MaterializedTable createMaterializedView()
> > >>>>>>>>
> > >>>>>>>> ?
> > >>>>>>>>
> > >>>>>>>> The second option with returning a handle I think is more
> flexible
> > >>>>> and
> > >>>>>>>> could provide features such as “refresh”/“delete” or generally
> > >>>>> speaking
> > >>>>>>>> manage the the view. In the future we could also think about
> > adding
> > >>>>>> hooks
> > >>>>>>>> to automatically refresh view etc. It is also more explicit -
> > >>>>>>>> materialization returning a new table handle will not have the
> > same
> > >>>>>>>> implicit side effects as adding a simple line of code like
> > >>>>> `b.cache()`
> > >>>>>>>> would have.
> > >>>>>>>>
> > >>>>>>>> It would also be more SQL like, making it more intuitive for
> users
> > >>>>>>> already
> > >>>>>>>> familiar with the SQL.
> > >>>>>>>>
> > >>>>>>>> Piotrek
> > >>>>>>>>
> > >>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <becket....@gmail.com>
> > wrote:
> > >>>>>>>>>
> > >>>>>>>>> Hi Piotrek,
> > >>>>>>>>>
> > >>>>>>>>> For the cache() method itself, yes, it is equivalent to
> creating
> > a
> > >>>>>>>> BUILT-IN
> > >>>>>>>>> materialized view with a lifecycle. That functionality is
> missing
> > >>>>>>> today,
> > >>>>>>>>> though. Not sure if I understand your question. Do you mean we
> > >>>>>> already
> > >>>>>>>> have
> > >>>>>>>>> the functionality and just need a syntax sugar?
> > >>>>>>>>>
> > >>>>>>>>> What's more interesting in the proposal is do we want to stop
> at
> > >>>>>>> creating
> > >>>>>>>>> the materialized view? Or do we want to extend that in the
> future
> > >>>>> to
> > >>>>>> a
> > >>>>>>>> more
> > >>>>>>>>> useful unified data store distributed with Flink? And do we
> want
> > to
> > >>>>>>> have
> > >>>>>>>> a
> > >>>>>>>>> mechanism allow more flexible user job pattern with their own
> > user
> > >>>>>>>> defined
> > >>>>>>>>> services. These considerations are much more architectural.
> > >>>>>>>>>
> > >>>>>>>>> Thanks,
> > >>>>>>>>>
> > >>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
> > >>>>>>> pi...@data-artisans.com>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Hi,
> > >>>>>>>>>>
> > >>>>>>>>>> Interesting idea. I’m trying to understand the problem. Isn’t
> > the
> > >>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
> later
> > >>>>>>> reading
> > >>>>>>>>>> from it? Where this sink has a limited live scope/live time?
> And
> > >>>>> the
> > >>>>>>>> sink
> > >>>>>>>>>> could be implemented as in memory or a file sink?
> > >>>>>>>>>>
> > >>>>>>>>>> If so, what’s the problem with creating a materialised view
> > from a
> > >>>>>>> table
> > >>>>>>>>>> “b” (from your document’s example) and reusing this
> materialised
> > >>>>>> view
> > >>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
> materialised
> > >>>>>> views
> > >>>>>>>> (for
> > >>>>>>>>>> example when current session finishes)? Maybe we need some
> > >>>>> syntactic
> > >>>>>>>> sugar
> > >>>>>>>>>> on top of it?
> > >>>>>>>>>>
> > >>>>>>>>>> Piotrek
> > >>>>>>>>>>
> > >>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket....@gmail.com>
> > >>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the suggestion, Jincheng.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Yes, I think it makes sense to have a persist() with
> > >>>>>>> lifecycle/defined
> > >>>>>>>>>>> scope. I just added a section in the future work for this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
> > >>>>>>> sunjincheng...@gmail.com
> > >>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Jiangjie,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thank you for the explanation about the name of `cache()`, I
> > >>>>>>>> understand
> > >>>>>>>>>> why
> > >>>>>>>>>>>> you designed this way!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Another idea is whether we can specify a lifecycle for data
> > >>>>>>>> persistence?
> > >>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
> is
> > >>>>> not
> > >>>>>>>>>> worried
> > >>>>>>>>>>>> about data loss, and will clearly specify the time range for
> > >>>>>> keeping
> > >>>>>>>>>> time.
> > >>>>>>>>>>>> At the same time, if we want to expand, we can also share
> in a
> > >>>>>>> certain
> > >>>>>>>>>>>> group of session, for example:
> LifeCycle.SESSION_GROUP(...), I
> > >>>>> am
> > >>>>>>> not
> > >>>>>>>>>> sure,
> > >>>>>>>>>>>> just an immature suggestion, for reference only!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Bests,
> > >>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月23日周五 下午1:33写道：
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Re: Jincheng,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s. persist(),
> > >>>>>>>> personally I
> > >>>>>>>>>>>>> find cache() to be more accurately describing the behavior,
> > >>>>> i.e.
> > >>>>>>> the
> > >>>>>>>>>>>> Table
> > >>>>>>>>>>>>> is cached for the session, but will be deleted after the
> > >>>>> session
> > >>>>>> is
> > >>>>>>>>>>>> closed.
> > >>>>>>>>>>>>> persist() seems a little misleading as people might think
> the
> > >>>>>> table
> > >>>>>>>>>> will
> > >>>>>>>>>>>>> still be there even after the session is gone.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Great point about mixing the batch and stream processing in
> > the
> > >>>>>>> same
> > >>>>>>>>>> job.
> > >>>>>>>>>>>>> We should absolutely move towards that goal. I imagine that
> > >>>>> would
> > >>>>>>> be
> > >>>>>>>> a
> > >>>>>>>>>>>> huge
> > >>>>>>>>>>>>> change across the board, including sources, operators and
> > >>>>>>>>>> optimizations,
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>> name some. Likely we will need several separate in-depth
> > >>>>>>> discussions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
> > >>>>> xingc...@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are both
> > >>>>>>>> orthogonal
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>> the cache problem. Essentially, this may be the first time
> > we
> > >>>>>> plan
> > >>>>>>>> to
> > >>>>>>>>>>>>>> introduce another storage mechanism other than the state.
> > >>>>> Maybe
> > >>>>>>> it’s
> > >>>>>>>>>>>>> better
> > >>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
> > specific
> > >>>>>>> part?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
> > underlying
> > >>>>>>>>>> service.
> > >>>>>>>>>>>>>> This seems to be quite a major change to the existing
> > >>>>> codebase.
> > >>>>>> As
> > >>>>>>>> you
> > >>>>>>>>>>>>>> claimed, the service should be extendible to support other
> > >>>>>>>> components
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>> we’d better discussed it in another thread.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
> Table
> > >>>>>> API,
> > >>>>>>> in
> > >>>>>>>>>>>> case
> > >>>>>>>>>>>>>> of a general and flexible enough service mechanism.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>> Xingcan
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
> > >>>>>> xiaow...@gmail.com>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up is
> > not
> > >>>>>> very
> > >>>>>>>>>>>>>> reliable.
> > >>>>>>>>>>>>>>> There is no guarantee that it will be executed
> > successfully.
> > >>>>> We
> > >>>>>>> may
> > >>>>>>>>>>>>> risk
> > >>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to have
> an
> > >>>>>>>>>>>> association
> > >>>>>>>>>>>>>>> between temp table and session id. So we can always clean
> > up
> > >>>>>> temp
> > >>>>>>>>>>>>> tables
> > >>>>>>>>>>>>>>> which are no longer associated with any active sessions.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Regards,
> > >>>>>>>>>>>>>>> Xiaowei
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
> > >>>>>>>>>>>>> sunjincheng...@gmail.com>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Thanks for initiating this great proposal！
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Interactive Programming is very useful and user friendly
> > in
> > >>>>>> case
> > >>>>>>>> of
> > >>>>>>>>>>>>> your
> > >>>>>>>>>>>>>>>> examples.
> > >>>>>>>>>>>>>>>> Moreover， especially when a business has to be executed
> in
> > >>>>>>> several
> > >>>>>>>>>>>>>> stages
> > >>>>>>>>>>>>>>>> with dependencies，such as the pipeline of Flink ML, in
> > order
> > >>>>>> to
> > >>>>>>>>>>>>> utilize
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>> intermediate calculation results we have to submit a job
> > by
> > >>>>>>>>>>>>>> env.execute().
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
> > >>>>> `persist()`,
> > >>>>>>> And
> > >>>>>>>>>>>> The
> > >>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
> in
> > >>>>>> memory
> > >>>>>>>> or
> > >>>>>>>>>>>>>> persist
> > >>>>>>>>>>>>>>>> to the storage system，Maybe save the data into state
> > backend
> > >>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
> for
> > >>>>>>>> streaming
> > >>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
> in
> > >>>>>>>>>>>> "Interactive
> > >>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
> > FLIP!
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>>> Jincheng
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月20日周二
> > 下午9:56写道：
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi all,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it is a
> > >>>>>>> promising
> > >>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
> > aspects,
> > >>>>>>>>>>>> including
> > >>>>>>>>>>>>>>>>> functionality and ease of use among others. One of the
> > >>>>>>> scenarios
> > >>>>>>>>>>>>> where
> > >>>>>>>>>>>>>> we
> > >>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming. To
> > >>>>>> explain
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>> issues
> > >>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
> > >>>>>> together
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>>> following document with our proposal.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> >
> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>
> > >>>
> >
> >
>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Reply via email to