Re: [DISCUSS] Support Interactive Programming in Flink Table API

Piotr Nowojski Fri, 23 Nov 2018 07:32:50 -0800

Hi Becket,

Ops, sorry I didn’t notice that you intend to reuse existing `TableFactory`. I 
don’t know why, but I assumed that you want to provide an alternate way of 
writing the data.


Now that I hopefully understand the proposal, maybe we could rename `cache()` 
to 

void materialize()

or going step further

MaterializedTable materialize()
MaterializedTable createMaterializedView()

? 

The second option with returning a handle I think is more flexible and could 
provide features such as “refresh”/“delete” or generally speaking manage the 
the view. In the future we could also think about adding hooks to automatically 
refresh view etc. It is also more explicit - materialization returning a new 
table handle will not have the same implicit side effects as adding a simple 
line of code like `b.cache()` would have.

It would also be more SQL like, making it more intuitive for users already 
familiar with the SQL.

Piotrek

> On 23 Nov 2018, at 14:53, Becket Qin <becket....@gmail.com> wrote:
> 
> Hi Piotrek,
> 
> For the cache() method itself, yes, it is equivalent to creating a BUILT-IN
> materialized view with a lifecycle. That functionality is missing today,
> though. Not sure if I understand your question. Do you mean we already have
> the functionality and just need a syntax sugar?
> 
> What's more interesting in the proposal is do we want to stop at creating
> the materialized view? Or do we want to extend that in the future to a more
> useful unified data store distributed with Flink? And do we want to have a
> mechanism allow more flexible user job pattern with their own user defined
> services. These considerations are much more architectural.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <pi...@data-artisans.com>
> wrote:
> 
>> Hi,
>> 
>> Interesting idea. I’m trying to understand the problem. Isn’t the
>> `cache()` call an equivalent of writing data to a sink and later reading
>> from it? Where this sink has a limited live scope/live time? And the sink
>> could be implemented as in memory or a file sink?
>> 
>> If so, what’s the problem with creating a materialised view from a table
>> “b” (from your document’s example) and reusing this materialised view
>> later? Maybe we are lacking mechanisms to clean up materialised views (for
>> example when current session finishes)? Maybe we need some syntactic sugar
>> on top of it?
>> 
>> Piotrek
>> 
>>> On 23 Nov 2018, at 07:21, Becket Qin <becket....@gmail.com> wrote:
>>> 
>>> Thanks for the suggestion, Jincheng.
>>> 
>>> Yes, I think it makes sense to have a persist() with lifecycle/defined
>>> scope. I just added a section in the future work for this.
>>> 
>>> Thanks,
>>> 
>>> Jiangjie (Becket) Qin
>>> 
>>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <sunjincheng...@gmail.com>
>>> wrote:
>>> 
>>>> Hi Jiangjie,
>>>> 
>>>> Thank you for the explanation about the name of `cache()`, I understand
>> why
>>>> you designed this way!
>>>> 
>>>> Another idea is whether we can specify a lifecycle for data persistence?
>>>> For example, persist (LifeCycle.SESSION), so that the user is not
>> worried
>>>> about data loss, and will clearly specify the time range for keeping
>> time.
>>>> At the same time, if we want to expand, we can also share in a certain
>>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not
>> sure,
>>>> just an immature suggestion, for reference only!
>>>> 
>>>> Bests,
>>>> Jincheng
>>>> 
>>>> Becket Qin <becket....@gmail.com> 于2018年11月23日周五 下午1:33写道：
>>>> 
>>>>> Re: Jincheng,
>>>>> 
>>>>> Thanks for the feedback. Regarding cache() v.s. persist(), personally I
>>>>> find cache() to be more accurately describing the behavior, i.e. the
>>>> Table
>>>>> is cached for the session, but will be deleted after the session is
>>>> closed.
>>>>> persist() seems a little misleading as people might think the table
>> will
>>>>> still be there even after the session is gone.
>>>>> 
>>>>> Great point about mixing the batch and stream processing in the same
>> job.
>>>>> We should absolutely move towards that goal. I imagine that would be a
>>>> huge
>>>>> change across the board, including sources, operators and
>> optimizations,
>>>> to
>>>>> name some. Likely we will need several separate in-depth discussions.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Jiangjie (Becket) Qin
>>>>> 
>>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xingc...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> @Shaoxuan, I think the lifecycle or access domain are both orthogonal
>>>> to
>>>>>> the cache problem. Essentially, this may be the first time we plan to
>>>>>> introduce another storage mechanism other than the state. Maybe it’s
>>>>> better
>>>>>> to first draw a big picture and then concentrate on a specific part?
>>>>>> 
>>>>>> @Becket, yes, actually I am more concerned with the underlying
>> service.
>>>>>> This seems to be quite a major change to the existing codebase. As you
>>>>>> claimed, the service should be extendible to support other components
>>>> and
>>>>>> we’d better discussed it in another thread.
>>>>>> 
>>>>>> All in all, I also eager to enjoy the more interactive Table API, in
>>>> case
>>>>>> of a general and flexible enough service mechanism.
>>>>>> 
>>>>>> Best,
>>>>>> Xingcan
>>>>>> 
>>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xiaow...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>> Relying on a callback for the temp table for clean up is not very
>>>>>> reliable.
>>>>>>> There is no guarantee that it will be executed successfully. We may
>>>>> risk
>>>>>>> leaks when that happens. I think that it's safer to have an
>>>> association
>>>>>>> between temp table and session id. So we can always clean up temp
>>>>> tables
>>>>>>> which are no longer associated with any active sessions.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Xiaowei
>>>>>>> 
>>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun <
>>>>> sunjincheng...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Jiangjie&Shaoxuan,
>>>>>>>> 
>>>>>>>> Thanks for initiating this great proposal！
>>>>>>>> 
>>>>>>>> Interactive Programming is very useful and user friendly in case of
>>>>> your
>>>>>>>> examples.
>>>>>>>> Moreover， especially when a business has to be executed in several
>>>>>> stages
>>>>>>>> with dependencies，such as the pipeline of Flink ML, in order to
>>>>> utilize
>>>>>> the
>>>>>>>> intermediate calculation results we have to submit a job by
>>>>>> env.execute().
>>>>>>>> 
>>>>>>>> About the `cache()`  , I think is better to named `persist()`, And
>>>> The
>>>>>>>> Flink framework determines whether we internally cache in memory or
>>>>>> persist
>>>>>>>> to the storage system，Maybe save the data into state backend
>>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.)
>>>>>>>> 
>>>>>>>> BTW, from the points of my view in the future, support for streaming
>>>>> and
>>>>>>>> batch mode switching in the same job will also benefit in
>>>> "Interactive
>>>>>>>> Programming",  I am looking forward to your JIRAs and FLIP!
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Jincheng
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月20日周二 下午9:56写道：
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> As a few recent email threads have pointed out, it is a promising
>>>>>>>>> opportunity to enhance Flink Table API in various aspects,
>>>> including
>>>>>>>>> functionality and ease of use among others. One of the scenarios
>>>>> where
>>>>>> we
>>>>>>>>> feel Flink could improve is interactive programming. To explain the
>>>>>>>> issues
>>>>>>>>> and facilitate the discussion on the solution, we put together the
>>>>>>>>> following document with our proposal.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing
>>>>>>>>> 
>>>>>>>>> Feedback and comments are very welcome!
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Jiangjie (Becket) Qin
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: [DISCUSS] Support Interactive Programming in Flink Table API

Reply via email to