Thanks for the explanation, Piotrek. Is there any extra thing user can do on a MaterializedTable that they cannot do on a Table? After users call *table.cache(), *users can just use that table and do anything that is supported on a Table, including SQL.
Naming wise, either cache() or materialize() sounds fine to me. cache() is a bit more general than materialize(). Given that we are enhancing the Table API to also support non-relational processing cases, cache() might be slightly better. Thanks, Jiangjie (Becket) Qin On Fri, Nov 23, 2018 at 11:25 PM Piotr Nowojski <pi...@data-artisans.com> wrote: > Hi Becket, > > Ops, sorry I didn’t notice that you intend to reuse existing > `TableFactory`. I don’t know why, but I assumed that you want to provide an > alternate way of writing the data. > > Now that I hopefully understand the proposal, maybe we could rename > `cache()` to > > void materialize() > > or going step further > > MaterializedTable materialize() > MaterializedTable createMaterializedView() > > ? > > The second option with returning a handle I think is more flexible and > could provide features such as “refresh”/“delete” or generally speaking > manage the the view. In the future we could also think about adding hooks > to automatically refresh view etc. It is also more explicit - > materialization returning a new table handle will not have the same > implicit side effects as adding a simple line of code like `b.cache()` > would have. > > It would also be more SQL like, making it more intuitive for users already > familiar with the SQL. > > Piotrek > > > On 23 Nov 2018, at 14:53, Becket Qin <becket....@gmail.com> wrote: > > > > Hi Piotrek, > > > > For the cache() method itself, yes, it is equivalent to creating a > BUILT-IN > > materialized view with a lifecycle. That functionality is missing today, > > though. Not sure if I understand your question. Do you mean we already > have > > the functionality and just need a syntax sugar? > > > > What's more interesting in the proposal is do we want to stop at creating > > the materialized view? Or do we want to extend that in the future to a > more > > useful unified data store distributed with Flink? And do we want to have > a > > mechanism allow more flexible user job pattern with their own user > defined > > services. These considerations are much more architectural. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <pi...@data-artisans.com> > > wrote: > > > >> Hi, > >> > >> Interesting idea. I’m trying to understand the problem. Isn’t the > >> `cache()` call an equivalent of writing data to a sink and later reading > >> from it? Where this sink has a limited live scope/live time? And the > sink > >> could be implemented as in memory or a file sink? > >> > >> If so, what’s the problem with creating a materialised view from a table > >> “b” (from your document’s example) and reusing this materialised view > >> later? Maybe we are lacking mechanisms to clean up materialised views > (for > >> example when current session finishes)? Maybe we need some syntactic > sugar > >> on top of it? > >> > >> Piotrek > >> > >>> On 23 Nov 2018, at 07:21, Becket Qin <becket....@gmail.com> wrote: > >>> > >>> Thanks for the suggestion, Jincheng. > >>> > >>> Yes, I think it makes sense to have a persist() with lifecycle/defined > >>> scope. I just added a section in the future work for this. > >>> > >>> Thanks, > >>> > >>> Jiangjie (Becket) Qin > >>> > >>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <sunjincheng...@gmail.com > > > >>> wrote: > >>> > >>>> Hi Jiangjie, > >>>> > >>>> Thank you for the explanation about the name of `cache()`, I > understand > >> why > >>>> you designed this way! > >>>> > >>>> Another idea is whether we can specify a lifecycle for data > persistence? > >>>> For example, persist (LifeCycle.SESSION), so that the user is not > >> worried > >>>> about data loss, and will clearly specify the time range for keeping > >> time. > >>>> At the same time, if we want to expand, we can also share in a certain > >>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not > >> sure, > >>>> just an immature suggestion, for reference only! > >>>> > >>>> Bests, > >>>> Jincheng > >>>> > >>>> Becket Qin <becket....@gmail.com> 于2018年11月23日周五 下午1:33写道: > >>>> > >>>>> Re: Jincheng, > >>>>> > >>>>> Thanks for the feedback. Regarding cache() v.s. persist(), > personally I > >>>>> find cache() to be more accurately describing the behavior, i.e. the > >>>> Table > >>>>> is cached for the session, but will be deleted after the session is > >>>> closed. > >>>>> persist() seems a little misleading as people might think the table > >> will > >>>>> still be there even after the session is gone. > >>>>> > >>>>> Great point about mixing the batch and stream processing in the same > >> job. > >>>>> We should absolutely move towards that goal. I imagine that would be > a > >>>> huge > >>>>> change across the board, including sources, operators and > >> optimizations, > >>>> to > >>>>> name some. Likely we will need several separate in-depth discussions. > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Jiangjie (Becket) Qin > >>>>> > >>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xingc...@gmail.com> > >> wrote: > >>>>> > >>>>>> Hi all, > >>>>>> > >>>>>> @Shaoxuan, I think the lifecycle or access domain are both > orthogonal > >>>> to > >>>>>> the cache problem. Essentially, this may be the first time we plan > to > >>>>>> introduce another storage mechanism other than the state. Maybe it’s > >>>>> better > >>>>>> to first draw a big picture and then concentrate on a specific part? > >>>>>> > >>>>>> @Becket, yes, actually I am more concerned with the underlying > >> service. > >>>>>> This seems to be quite a major change to the existing codebase. As > you > >>>>>> claimed, the service should be extendible to support other > components > >>>> and > >>>>>> we’d better discussed it in another thread. > >>>>>> > >>>>>> All in all, I also eager to enjoy the more interactive Table API, in > >>>> case > >>>>>> of a general and flexible enough service mechanism. > >>>>>> > >>>>>> Best, > >>>>>> Xingcan > >>>>>> > >>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xiaow...@gmail.com> > >>>>> wrote: > >>>>>>> > >>>>>>> Relying on a callback for the temp table for clean up is not very > >>>>>> reliable. > >>>>>>> There is no guarantee that it will be executed successfully. We may > >>>>> risk > >>>>>>> leaks when that happens. I think that it's safer to have an > >>>> association > >>>>>>> between temp table and session id. So we can always clean up temp > >>>>> tables > >>>>>>> which are no longer associated with any active sessions. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Xiaowei > >>>>>>> > >>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun < > >>>>> sunjincheng...@gmail.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Jiangjie&Shaoxuan, > >>>>>>>> > >>>>>>>> Thanks for initiating this great proposal! > >>>>>>>> > >>>>>>>> Interactive Programming is very useful and user friendly in case > of > >>>>> your > >>>>>>>> examples. > >>>>>>>> Moreover, especially when a business has to be executed in several > >>>>>> stages > >>>>>>>> with dependencies,such as the pipeline of Flink ML, in order to > >>>>> utilize > >>>>>> the > >>>>>>>> intermediate calculation results we have to submit a job by > >>>>>> env.execute(). > >>>>>>>> > >>>>>>>> About the `cache()` , I think is better to named `persist()`, And > >>>> The > >>>>>>>> Flink framework determines whether we internally cache in memory > or > >>>>>> persist > >>>>>>>> to the storage system,Maybe save the data into state backend > >>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.) > >>>>>>>> > >>>>>>>> BTW, from the points of my view in the future, support for > streaming > >>>>> and > >>>>>>>> batch mode switching in the same job will also benefit in > >>>> "Interactive > >>>>>>>> Programming", I am looking forward to your JIRAs and FLIP! > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Jincheng > >>>>>>>> > >>>>>>>> > >>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月20日周二 下午9:56写道: > >>>>>>>> > >>>>>>>>> Hi all, > >>>>>>>>> > >>>>>>>>> As a few recent email threads have pointed out, it is a promising > >>>>>>>>> opportunity to enhance Flink Table API in various aspects, > >>>> including > >>>>>>>>> functionality and ease of use among others. One of the scenarios > >>>>> where > >>>>>> we > >>>>>>>>> feel Flink could improve is interactive programming. To explain > the > >>>>>>>> issues > >>>>>>>>> and facilitate the discussion on the solution, we put together > the > >>>>>>>>> following document with our proposal. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>>> > >> > https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing > >>>>>>>>> > >>>>>>>>> Feedback and comments are very welcome! > >>>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> > >>>>>>>>> Jiangjie (Becket) Qin > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >> > >> > >