Re: [DISCUSS] Support Interactive Programming in Flink Table API

Piotr Nowojski Fri, 30 Nov 2018 05:14:34 -0800

Hi Shaoxuan,

Re 2:


> Table t3 = methodThatAppliesOperators(t1) // t1 is modified to-> t1’

What do you mean that “ t1 is modified to-> t1’ ” ? That 
`methodThatAppliesOperators()` method has changed it’s plan? 

I was thinking more about something like this:

Table source = … // some source that scans files from a directory “/foo/bar/“
Table t1 = source.groupBy(…).select(…).where(…) ….; 
Table t2 = t1.materialize() // (or `cache()`)

t2.count() // initialise cache (if it’s lazily initialised)

int a1 = t1.count()
int b1 = t2.count() 

// something in the background (or we trigger it) writes new files to /foo/bar

int a2 = t1.count()
int b2 = t2.count() 

t2.refresh() // possible future extension, not to be implemented in the initial 
version

int a3 = t1.count()
int b3 = t2.count() 

t2.drop() // another possible future extension, manual “cache” dropping

assertTrue(a1 == b1) // same results, but b1 comes from the “cache"
assertTrue(b1 == b2) // both values come from the same cache 
assertTrue(a2 > b2) // b2 comes from cache, a2 re-executed full table scan and 
has more data
assertTrue(b3 > b2) // b3 comes from refreshed cache
assertTrue(b3 == a2 == a3) 

Piotrek

> On 30 Nov 2018, at 10:22, Jark Wu <imj...@gmail.com> wrote:
> 
> Hi,
> 
> It is an very interesting and useful design!
> 
> Here I want to share some of my thoughts:
> 
> 1. Agree with that cache() method should return some Table to avoid some
> unexpected problems because of the mutable object.
>   All the existing methods of Table are returning a new Table instance.
> 
> 2. I think materialize() would be more consistent with SQL, this makes it
> possible to support the same feature for SQL (materialize view) and keep
> the same API for users in the future.
>   But I'm also fine if we choose cache().
> 
> 3. In the proposal, a TableService (or FlinkService?) is used to cache the
> result of the (intermediate) table.
>   But the name of TableService may be a bit general which is not quite
> understanding correctly in the first glance (a metastore for tables?).
>   Maybe a more specific name would be better, such as TableCacheSerive or
> TableMaterializeSerivce or something else.
> 
> Best,
> Jark
> 
> 
> On Thu, 29 Nov 2018 at 21:16, Fabian Hueske <fhue...@gmail.com> wrote:
> 
>> Hi,
>> 
>> Thanks for the clarification Becket!
>> 
>> I have a few thoughts to share / questions:
>> 
>> 1) I'd like to know how you plan to implement the feature on a plan /
>> planner level.
>> 
>> I would imaging the following to happen when Table.cache() is called:
>> 
>> 1) immediately optimize the Table and internally convert it into a
>> DataSet/DataStream. This is necessary, to avoid that operators of later
>> queries on top of the Table are pushed down.
>> 2) register the DataSet/DataStream as a DataSet/DataStream-backed Table X
>> 3) add a sink to the DataSet/DataStream. This is the materialization of the
>> Table X
>> 
>> Based on your proposal the following would happen:
>> 
>> Table t1 = ....
>> t1.cache(); // cache() returns void. The logical plan of t1 is replaced by
>> a scan of X. There is also a reference to the materialization of X.
>> 
>> t1.count(); // this executes the program, including the DataSet/DataStream
>> that backs X and the sink that writes the materialization of X
>> t1.count(); // this executes the program, but reads X from the
>> materialization.
>> 
>> My question is, how do you determine when whether the scan of t1 should go
>> against the DataSet/DataStream program and when against the
>> materialization?
>> AFAIK, there is no hook that will tell you that a part of the program was
>> executed. Flipping a switch during optimization or plan generation is not
>> sufficient as there is no guarantee that the plan is also executed.
>> 
>> Overall, this behavior is somewhat similar to what I proposed in
>> FLINK-8950, which does not include persisting the table, but just
>> optimizing and reregistering it as DataSet/DataStream scan.
>> 
>> 2) I think Piotr has a point about the implicit behavior and side effects
>> of the cache() method if it does not return anything.
>> Consider the following example:
>> 
>> Table t1 = ???
>> Table t2 = methodThatAppliesOperators(t1);
>> Table t3 = methodThatAppliesOtherOperators(t1);
>> 
>> In this case, the behavior/performance of the plan that results from the
>> second method call depends on whether t1 was modified by the first method
>> or not.
>> This is the classic issue of mutable vs. immutable objects.
>> Also, as Piotr pointed out, it might also be good to have the original plan
>> of t1, because in some cases it is possible to push filters down such that
>> evaluating the query from scratch might be more efficient than accessing
>> the cache.
>> Moreover, a CachedTable could extend Table() and offer a method refresh().
>> This sounds quite useful in an interactive session mode.
>> 
>> 3) Regarding the name, I can see both arguments. IMO, materialize() seems
>> to be more future proof.
>> 
>> Best, Fabian
>> 
>> Am Do., 29. Nov. 2018 um 12:56 Uhr schrieb Shaoxuan Wang <
>> wshaox...@gmail.com>:
>> 
>>> Hi Piotr,
>>> 
>>> Thanks for sharing your ideas on the method naming. We will think about
>>> your suggestions. But I don't understand why we need to change the return
>>> type of cache().
>>> 
>>> Cache() is a physical operation, it does not change the logic of
>>> the `Table`. On the tableAPI layer, we should not introduce a new table
>>> type unless the logic of table has been changed. If we introduce a new
>>> table type `CachedTable`, we need create the same set of methods of
>> `Table`
>>> for it. I don't think it is worth doing this. Or can you please elaborate
>>> more on what could be the "implicit behaviours/side effects" you are
>>> thinking about?
>>> 
>>> Regards,
>>> Shaoxuan
>>> 
>>> 
>>> 
>>> On Thu, Nov 29, 2018 at 7:05 PM Piotr Nowojski <pi...@data-artisans.com>
>>> wrote:
>>> 
>>>> Hi Becket,
>>>> 
>>>> Thanks for the response.
>>>> 
>>>> 1. I wasn’t saying that materialised view must be mutable or not. The
>>> same
>>>> thing applies to caches as well. To the contrary, I would expect more
>>>> consistency and updates from something that is called “cache” vs
>>> something
>>>> that’s a “materialised view”. In other words, IMO most caches do not
>>> serve
>>>> you invalid/outdated data and they handle updates on their own.
>>>> 
>>>> 2. I don’t think that having in the future two very similar concepts of
>>>> `materialized` view and `cache` is a good idea. It would be confusing
>> for
>>>> the users. I think it could be handled by variations/overloading of
>>>> materialised view concept. We could start with:
>>>> 
>>>> `MaterializedTable materialize()` - immutable, session life scope
>>>> (basically the same semantic as you are proposing
>>>> 
>>>> And then in the future (if ever) build on top of that/expand it with:
>>>> 
>>>> `MaterializedTable materialize(refreshTime=…)` or `MaterializedTable
>>>> materialize(refreshHook=…)`
>>>> 
>>>> Or with cross session support:
>>>> 
>>>> `MaterializedTable materializeInto(connector=…)` or `MaterializedTable
>>>> materializeInto(tableFactory=…)`
>>>> 
>>>> I’m not saying that we should implement cross session/refreshing now or
>>>> even in the near future. I’m just arguing that naming current immutable
>>>> session life scope method `materialize()` is more future proof and more
>>>> consistent with SQL (on which after all table-api is heavily basing
>> on).
>>>> 
>>>> 3. Even if we agree on naming it `cache()`, I would still insist on
>>>> `cache()` returning `CachedTable` handle to avoid implicit
>>> behaviours/side
>>>> effects and to give both us & users more flexibility.
>>>> 
>>>> Piotrek
>>>> 
>>>>> On 29 Nov 2018, at 06:20, Becket Qin <becket....@gmail.com> wrote:
>>>>> 
>>>>> Just to add a little bit, the materialized view is probably more
>>> similar
>>>> to
>>>>> the persistent() brought up earlier in the thread. So it is usually
>>> cross
>>>>> session and could be used in a larger scope. For example, a
>>> materialized
>>>>> view created by user A may be visible to user B. It is probably
>>> something
>>>>> we want to have in the future. I'll put it in the future work
>> section.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jiangjie (Becket) Qin
>>>>> 
>>>>> On Thu, Nov 29, 2018 at 9:47 AM Becket Qin <becket....@gmail.com>
>>> wrote:
>>>>> 
>>>>>> Hi Piotrek,
>>>>>> 
>>>>>> Thanks for the explanation.
>>>>>> 
>>>>>> Right now we are mostly thinking of the cached table as immutable. I
>>> can
>>>>>> see the Materialized view would be useful in the future. That said,
>> I
>>>> think
>>>>>> a simple cache mechanism is probably still needed. So to me, cache()
>>> and
>>>>>> materialize() should be two separate method as they address
>> different
>>>>>> needs. Materialize() is a higher level concept usually implying
>>>> periodical
>>>>>> update, while cache() has much simpler semantic. For example, one
>> may
>>>>>> create a materialized view and use cache() method in the
>> materialized
>>>> view
>>>>>> creation logic. So that during the materialized view update, they do
>>> not
>>>>>> need to worry about the case that the cached table is also changed.
>>>> Maybe
>>>>>> under the hood, materialized() and cache() could share some
>> mechanism,
>>>> but
>>>>>> I think a simple cache() method would be handy in a lot of cases.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Jiangjie (Becket) Qin
>>>>>> 
>>>>>> On Mon, Nov 26, 2018 at 9:38 PM Piotr Nowojski <
>>> pi...@data-artisans.com
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi Becket,
>>>>>>> 
>>>>>>>> Is there any extra thing user can do on a MaterializedTable that
>>> they
>>>>>>> cannot do on a Table?
>>>>>>> 
>>>>>>> Maybe not in the initial implementation, but various DBs offer
>>>> different
>>>>>>> ways to “refresh” the materialised view. Hooks, triggers, timers,
>>>> manually
>>>>>>> etc. Having `MaterializedTable` would help us to handle that in the
>>>> future.
>>>>>>> 
>>>>>>>> After users call *table.cache(), *users can just use that table
>> and
>>> do
>>>>>>> anything that is supported on a Table, including SQL.
>>>>>>> 
>>>>>>> This is some implicit behaviour with side effects. Imagine if user
>>> has
>>>> a
>>>>>>> long and complicated program, that touches table `b` multiple
>> times,
>>>> maybe
>>>>>>> scattered around different methods. If he modifies his program by
>>>> inserting
>>>>>>> in one place
>>>>>>> 
>>>>>>> b.cache()
>>>>>>> 
>>>>>>> This implicitly alters the semantic and behaviour of his code all
>>> over
>>>>>>> the place, maybe in a ways that might cause problems. For example
>>> what
>>>> if
>>>>>>> underlying data is changing?
>>>>>>> 
>>>>>>> Having invisible side effects is also not very clean, for example
>>> think
>>>>>>> about something like this (but more complicated):
>>>>>>> 
>>>>>>> Table b = ...;
>>>>>>> 
>>>>>>> If (some_condition) {
>>>>>>> processTable1(b)
>>>>>>> }
>>>>>>> else {
>>>>>>> processTable2(b)
>>>>>>> }
>>>>>>> 
>>>>>>> // do more stuff with b
>>>>>>> 
>>>>>>> And user adds `b.cache()` call to only one of the `processTable1`
>> or
>>>>>>> `processTable2` methods.
>>>>>>> 
>>>>>>> On the other hand
>>>>>>> 
>>>>>>> Table materialisedB = b.materialize()
>>>>>>> 
>>>>>>> Avoids (at least some of) the side effect issues and forces user to
>>>>>>> explicitly use `materialisedB` where it’s appropriate and forces
>> user
>>>> to
>>>>>>> think what does it actually mean. And if something doesn’t work in
>>> the
>>>> end
>>>>>>> for the user, he will know what has he changed instead of blaming
>>>> Flink for
>>>>>>> some “magic” underneath. In the above example, after materialising
>> b
>>> in
>>>>>>> only one of the methods, he should/would realise about the issue
>> when
>>>>>>> handling the return value `MaterializedTable` of that method.
>>>>>>> 
>>>>>>> I guess it comes down to personal preferences if you like things to
>>> be
>>>>>>> implicit or not. The more power is the user, probably the more
>> likely
>>>> he is
>>>>>>> to like/understand implicit behaviour. And we as Table API
>> designers
>>>> are
>>>>>>> the most power users out there, so I would proceed with caution (so
>>>> that we
>>>>>>> do not end up in the crazy perl realm with it’s lovely implicit
>>> method
>>>>>>> arguments ;)  <https://stackoverflow.com/a/14922656/8149051>)
>>>>>>> 
>>>>>>>> Table API to also support non-relational processing cases, cache()
>>>>>>> might be slightly better.
>>>>>>> 
>>>>>>> I think even such extended Table API could benefit from sticking
>>>> to/being
>>>>>>> consistent with SQL where both SQL and Table API are basically the
>>>> same.
>>>>>>> 
>>>>>>> One more thing. `MaterializedTable materialize()` could be more
>>>>>>> powerful/flexible allowing the user to operate both on materialised
>>>> and not
>>>>>>> materialised view at the same time for whatever reasons (underlying
>>>> data
>>>>>>> changing/better optimisation opportunities after pushing down more
>>>> filters
>>>>>>> etc). For example:
>>>>>>> 
>>>>>>> Table b = …;
>>>>>>> 
>>>>>>> MaterlizedTable mb = b.materialize();
>>>>>>> 
>>>>>>> Val min = mb.min();
>>>>>>> Val max = mb.max();
>>>>>>> 
>>>>>>> Val user42 = b.filter(‘userId = 42);
>>>>>>> 
>>>>>>> Could be more efficient compared to `b.cache()` if `filter(‘userId
>> =
>>>>>>> 42);` allows for much more aggressive optimisations.
>>>>>>> 
>>>>>>> Piotrek
>>>>>>> 
>>>>>>>> On 26 Nov 2018, at 12:14, Fabian Hueske <fhue...@gmail.com>
>> wrote:
>>>>>>>> 
>>>>>>>> I'm not suggesting to add support for Ignite. This was just an
>>>> example.
>>>>>>>> Plasma and Arrow sound interesting, too.
>>>>>>>> For the sake of this proposal, it would be up to the user to
>>>> implement a
>>>>>>>> TableFactory and corresponding TableSource / TableSink classes to
>>>>>>> persist
>>>>>>>> and read the data.
>>>>>>>> 
>>>>>>>> Am Mo., 26. Nov. 2018 um 12:06 Uhr schrieb Flavio Pompermaier <
>>>>>>>> pomperma...@okkam.it>:
>>>>>>>> 
>>>>>>>>> What about to add also Apache Plasma + Arrow as an alternative to
>>>>>>> Apache
>>>>>>>>> Ignite?
>>>>>>>>> [1]
>>>>>>>>> 
>>>>>>> 
>>>> 
>> https://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
>>>>>>>>> 
>>>>>>>>> On Mon, Nov 26, 2018 at 11:56 AM Fabian Hueske <
>> fhue...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> Thanks for the proposal!
>>>>>>>>>> 
>>>>>>>>>> To summarize, you propose a new method Table.cache(): Table that
>>>> will
>>>>>>>>>> trigger a job and write the result into some temporary storage
>> as
>>>>>>> defined
>>>>>>>>>> by a TableFactory.
>>>>>>>>>> The cache() call blocks while the job is running and eventually
>>>>>>> returns a
>>>>>>>>>> Table object that represents a scan of the temporary table.
>>>>>>>>>> When the "session" is closed (closing to be defined?), the
>>> temporary
>>>>>>>>> tables
>>>>>>>>>> are all dropped.
>>>>>>>>>> 
>>>>>>>>>> I think this behavior makes sense and is a good first step
>> towards
>>>>>>> more
>>>>>>>>>> interactive workloads.
>>>>>>>>>> However, its performance suffers from writing to and reading
>> from
>>>>>>>>> external
>>>>>>>>>> systems.
>>>>>>>>>> I think this is OK for now. Changes that would significantly
>>> improve
>>>>>>> the
>>>>>>>>>> situation (i.e., pinning data in-memory across jobs) would have
>>>> large
>>>>>>>>>> impacts on many components of Flink.
>>>>>>>>>> Users could use in-memory filesystems or storage grids (Apache
>>>>>>> Ignite) to
>>>>>>>>>> mitigate some of the performance effects.
>>>>>>>>>> 
>>>>>>>>>> Best, Fabian
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Am Mo., 26. Nov. 2018 um 03:38 Uhr schrieb Becket Qin <
>>>>>>>>>> becket....@gmail.com
>>>>>>>>>>> :
>>>>>>>>>> 
>>>>>>>>>>> Thanks for the explanation, Piotrek.
>>>>>>>>>>> 
>>>>>>>>>>> Is there any extra thing user can do on a MaterializedTable
>> that
>>>> they
>>>>>>>>>>> cannot do on a Table? After users call *table.cache(), *users
>> can
>>>>>>> just
>>>>>>>>>> use
>>>>>>>>>>> that table and do anything that is supported on a Table,
>>> including
>>>>>>> SQL.
>>>>>>>>>>> 
>>>>>>>>>>> Naming wise, either cache() or materialize() sounds fine to me.
>>>>>>> cache()
>>>>>>>>>> is
>>>>>>>>>>> a bit more general than materialize(). Given that we are
>>> enhancing
>>>>>>> the
>>>>>>>>>>> Table API to also support non-relational processing cases,
>>> cache()
>>>>>>>>> might
>>>>>>>>>> be
>>>>>>>>>>> slightly better.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <
>>>>>>>>> pi...@data-artisans.com
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Becket,
>>>>>>>>>>>> 
>>>>>>>>>>>> Ops, sorry I didn’t notice that you intend to reuse existing
>>>>>>>>>>>> `TableFactory`. I don’t know why, but I assumed that you want
>> to
>>>>>>>>>> provide
>>>>>>>>>>> an
>>>>>>>>>>>> alternate way of writing the data.
>>>>>>>>>>>> 
>>>>>>>>>>>> Now that I hopefully understand the proposal, maybe we could
>>>> rename
>>>>>>>>>>>> `cache()` to
>>>>>>>>>>>> 
>>>>>>>>>>>> void materialize()
>>>>>>>>>>>> 
>>>>>>>>>>>> or going step further
>>>>>>>>>>>> 
>>>>>>>>>>>> MaterializedTable materialize()
>>>>>>>>>>>> MaterializedTable createMaterializedView()
>>>>>>>>>>>> 
>>>>>>>>>>>> ?
>>>>>>>>>>>> 
>>>>>>>>>>>> The second option with returning a handle I think is more
>>> flexible
>>>>>>>>> and
>>>>>>>>>>>> could provide features such as “refresh”/“delete” or generally
>>>>>>>>> speaking
>>>>>>>>>>>> manage the the view. In the future we could also think about
>>>> adding
>>>>>>>>>> hooks
>>>>>>>>>>>> to automatically refresh view etc. It is also more explicit -
>>>>>>>>>>>> materialization returning a new table handle will not have the
>>>> same
>>>>>>>>>>>> implicit side effects as adding a simple line of code like
>>>>>>>>> `b.cache()`
>>>>>>>>>>>> would have.
>>>>>>>>>>>> 
>>>>>>>>>>>> It would also be more SQL like, making it more intuitive for
>>> users
>>>>>>>>>>> already
>>>>>>>>>>>> familiar with the SQL.
>>>>>>>>>>>> 
>>>>>>>>>>>> Piotrek
>>>>>>>>>>>> 
>>>>>>>>>>>>> On 23 Nov 2018, at 14:53, Becket Qin <becket....@gmail.com>
>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Piotrek,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> For the cache() method itself, yes, it is equivalent to
>>> creating
>>>> a
>>>>>>>>>>>> BUILT-IN
>>>>>>>>>>>>> materialized view with a lifecycle. That functionality is
>>> missing
>>>>>>>>>>> today,
>>>>>>>>>>>>> though. Not sure if I understand your question. Do you mean
>> we
>>>>>>>>>> already
>>>>>>>>>>>> have
>>>>>>>>>>>>> the functionality and just need a syntax sugar?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What's more interesting in the proposal is do we want to stop
>>> at
>>>>>>>>>>> creating
>>>>>>>>>>>>> the materialized view? Or do we want to extend that in the
>>> future
>>>>>>>>> to
>>>>>>>>>> a
>>>>>>>>>>>> more
>>>>>>>>>>>>> useful unified data store distributed with Flink? And do we
>>> want
>>>> to
>>>>>>>>>>> have
>>>>>>>>>>>> a
>>>>>>>>>>>>> mechanism allow more flexible user job pattern with their own
>>>> user
>>>>>>>>>>>> defined
>>>>>>>>>>>>> services. These considerations are much more architectural.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <
>>>>>>>>>>> pi...@data-artisans.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Interesting idea. I’m trying to understand the problem.
>> Isn’t
>>>> the
>>>>>>>>>>>>>> `cache()` call an equivalent of writing data to a sink and
>>> later
>>>>>>>>>>> reading
>>>>>>>>>>>>>> from it? Where this sink has a limited live scope/live time?
>>> And
>>>>>>>>> the
>>>>>>>>>>>> sink
>>>>>>>>>>>>>> could be implemented as in memory or a file sink?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If so, what’s the problem with creating a materialised view
>>>> from a
>>>>>>>>>>> table
>>>>>>>>>>>>>> “b” (from your document’s example) and reusing this
>>> materialised
>>>>>>>>>> view
>>>>>>>>>>>>>> later? Maybe we are lacking mechanisms to clean up
>>> materialised
>>>>>>>>>> views
>>>>>>>>>>>> (for
>>>>>>>>>>>>>> example when current session finishes)? Maybe we need some
>>>>>>>>> syntactic
>>>>>>>>>>>> sugar
>>>>>>>>>>>>>> on top of it?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Piotrek
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 23 Nov 2018, at 07:21, Becket Qin <becket....@gmail.com
>>> 
>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks for the suggestion, Jincheng.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes, I think it makes sense to have a persist() with
>>>>>>>>>>> lifecycle/defined
>>>>>>>>>>>>>>> scope. I just added a section in the future work for this.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <
>>>>>>>>>>> sunjincheng...@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Jiangjie,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Thank you for the explanation about the name of
>> `cache()`, I
>>>>>>>>>>>> understand
>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>>> you designed this way!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Another idea is whether we can specify a lifecycle for
>> data
>>>>>>>>>>>> persistence?
>>>>>>>>>>>>>>>> For example, persist (LifeCycle.SESSION), so that the user
>>> is
>>>>>>>>> not
>>>>>>>>>>>>>> worried
>>>>>>>>>>>>>>>> about data loss, and will clearly specify the time range
>> for
>>>>>>>>>> keeping
>>>>>>>>>>>>>> time.
>>>>>>>>>>>>>>>> At the same time, if we want to expand, we can also share
>>> in a
>>>>>>>>>>> certain
>>>>>>>>>>>>>>>> group of session, for example:
>>> LifeCycle.SESSION_GROUP(...), I
>>>>>>>>> am
>>>>>>>>>>> not
>>>>>>>>>>>>>> sure,
>>>>>>>>>>>>>>>> just an immature suggestion, for reference only!
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Bests,
>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月23日周五
>> 下午1:33写道：
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Re: Jincheng,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks for the feedback. Regarding cache() v.s.
>> persist(),
>>>>>>>>>>>> personally I
>>>>>>>>>>>>>>>>> find cache() to be more accurately describing the
>> behavior,
>>>>>>>>> i.e.
>>>>>>>>>>> the
>>>>>>>>>>>>>>>> Table
>>>>>>>>>>>>>>>>> is cached for the session, but will be deleted after the
>>>>>>>>> session
>>>>>>>>>> is
>>>>>>>>>>>>>>>> closed.
>>>>>>>>>>>>>>>>> persist() seems a little misleading as people might think
>>> the
>>>>>>>>>> table
>>>>>>>>>>>>>> will
>>>>>>>>>>>>>>>>> still be there even after the session is gone.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Great point about mixing the batch and stream processing
>> in
>>>> the
>>>>>>>>>>> same
>>>>>>>>>>>>>> job.
>>>>>>>>>>>>>>>>> We should absolutely move towards that goal. I imagine
>> that
>>>>>>>>> would
>>>>>>>>>>> be
>>>>>>>>>>>> a
>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>> change across the board, including sources, operators and
>>>>>>>>>>>>>> optimizations,
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>> name some. Likely we will need several separate in-depth
>>>>>>>>>>> discussions.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <
>>>>>>>>> xingc...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> @Shaoxuan, I think the lifecycle or access domain are
>> both
>>>>>>>>>>>> orthogonal
>>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> the cache problem. Essentially, this may be the first
>> time
>>>> we
>>>>>>>>>> plan
>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> introduce another storage mechanism other than the
>> state.
>>>>>>>>> Maybe
>>>>>>>>>>> it’s
>>>>>>>>>>>>>>>>> better
>>>>>>>>>>>>>>>>>> to first draw a big picture and then concentrate on a
>>>> specific
>>>>>>>>>>> part?
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> @Becket, yes, actually I am more concerned with the
>>>> underlying
>>>>>>>>>>>>>> service.
>>>>>>>>>>>>>>>>>> This seems to be quite a major change to the existing
>>>>>>>>> codebase.
>>>>>>>>>> As
>>>>>>>>>>>> you
>>>>>>>>>>>>>>>>>> claimed, the service should be extendible to support
>> other
>>>>>>>>>>>> components
>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>> we’d better discussed it in another thread.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> All in all, I also eager to enjoy the more interactive
>>> Table
>>>>>>>>>> API,
>>>>>>>>>>> in
>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>>>> of a general and flexible enough service mechanism.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Xingcan
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <
>>>>>>>>>> xiaow...@gmail.com>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Relying on a callback for the temp table for clean up
>> is
>>>> not
>>>>>>>>>> very
>>>>>>>>>>>>>>>>>> reliable.
>>>>>>>>>>>>>>>>>>> There is no guarantee that it will be executed
>>>> successfully.
>>>>>>>>> We
>>>>>>>>>>> may
>>>>>>>>>>>>>>>>> risk
>>>>>>>>>>>>>>>>>>> leaks when that happens. I think that it's safer to
>> have
>>> an
>>>>>>>>>>>>>>>> association
>>>>>>>>>>>>>>>>>>> between temp table and session id. So we can always
>> clean
>>>> up
>>>>>>>>>> temp
>>>>>>>>>>>>>>>>> tables
>>>>>>>>>>>>>>>>>>> which are no longer associated with any active
>> sessions.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Xiaowei
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>>>>>>>>>>>>>> sunjincheng...@gmail.com>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for initiating this great proposal！
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Interactive Programming is very useful and user
>> friendly
>>>> in
>>>>>>>>>> case
>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>> your
>>>>>>>>>>>>>>>>>>>> examples.
>>>>>>>>>>>>>>>>>>>> Moreover， especially when a business has to be
>> executed
>>> in
>>>>>>>>>>> several
>>>>>>>>>>>>>>>>>> stages
>>>>>>>>>>>>>>>>>>>> with dependencies，such as the pipeline of Flink ML, in
>>>> order
>>>>>>>>>> to
>>>>>>>>>>>>>>>>> utilize
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> intermediate calculation results we have to submit a
>> job
>>>> by
>>>>>>>>>>>>>>>>>> env.execute().
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> About the `cache()`  , I think is better to named
>>>>>>>>> `persist()`,
>>>>>>>>>>> And
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>>>>> Flink framework determines whether we internally cache
>>> in
>>>>>>>>>> memory
>>>>>>>>>>>> or
>>>>>>>>>>>>>>>>>> persist
>>>>>>>>>>>>>>>>>>>> to the storage system，Maybe save the data into state
>>>> backend
>>>>>>>>>>>>>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> BTW, from the points of my view in the future, support
>>> for
>>>>>>>>>>>> streaming
>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> batch mode switching in the same job will also benefit
>>> in
>>>>>>>>>>>>>>>> "Interactive
>>>>>>>>>>>>>>>>>>>> Programming",  I am looking forward to your JIRAs and
>>>> FLIP!
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> Jincheng
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月20日周二
>>>> 下午9:56写道：
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> As a few recent email threads have pointed out, it
>> is a
>>>>>>>>>>> promising
>>>>>>>>>>>>>>>>>>>>> opportunity to enhance Flink Table API in various
>>>> aspects,
>>>>>>>>>>>>>>>> including
>>>>>>>>>>>>>>>>>>>>> functionality and ease of use among others. One of
>> the
>>>>>>>>>>> scenarios
>>>>>>>>>>>>>>>>> where
>>>>>>>>>>>>>>>>>> we
>>>>>>>>>>>>>>>>>>>>> feel Flink could improve is interactive programming.
>> To
>>>>>>>>>> explain
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> issues
>>>>>>>>>>>>>>>>>>>>> and facilitate the discussion on the solution, we put
>>>>>>>>>> together
>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>> following document with our proposal.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>>> 
>>> 
>>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Reply via email to