Hi Becket, Ops, sorry I didn’t notice that you intend to reuse existing `TableFactory`. I don’t know why, but I assumed that you want to provide an alternate way of writing the data.
Now that I hopefully understand the proposal, maybe we could rename `cache()` to void materialize() or going step further MaterializedTable materialize() MaterializedTable createMaterializedView() ? The second option with returning a handle I think is more flexible and could provide features such as “refresh”/“delete” or generally speaking manage the the view. In the future we could also think about adding hooks to automatically refresh view etc. It is also more explicit - materialization returning a new table handle will not have the same implicit side effects as adding a simple line of code like `b.cache()` would have. It would also be more SQL like, making it more intuitive for users already familiar with the SQL. Piotrek > On 23 Nov 2018, at 14:53, Becket Qin <becket....@gmail.com> wrote: > > Hi Piotrek, > > For the cache() method itself, yes, it is equivalent to creating a BUILT-IN > materialized view with a lifecycle. That functionality is missing today, > though. Not sure if I understand your question. Do you mean we already have > the functionality and just need a syntax sugar? > > What's more interesting in the proposal is do we want to stop at creating > the materialized view? Or do we want to extend that in the future to a more > useful unified data store distributed with Flink? And do we want to have a > mechanism allow more flexible user job pattern with their own user defined > services. These considerations are much more architectural. > > Thanks, > > Jiangjie (Becket) Qin > > On Fri, Nov 23, 2018 at 6:01 PM Piotr Nowojski <pi...@data-artisans.com> > wrote: > >> Hi, >> >> Interesting idea. I’m trying to understand the problem. Isn’t the >> `cache()` call an equivalent of writing data to a sink and later reading >> from it? Where this sink has a limited live scope/live time? And the sink >> could be implemented as in memory or a file sink? >> >> If so, what’s the problem with creating a materialised view from a table >> “b” (from your document’s example) and reusing this materialised view >> later? Maybe we are lacking mechanisms to clean up materialised views (for >> example when current session finishes)? Maybe we need some syntactic sugar >> on top of it? >> >> Piotrek >> >>> On 23 Nov 2018, at 07:21, Becket Qin <becket....@gmail.com> wrote: >>> >>> Thanks for the suggestion, Jincheng. >>> >>> Yes, I think it makes sense to have a persist() with lifecycle/defined >>> scope. I just added a section in the future work for this. >>> >>> Thanks, >>> >>> Jiangjie (Becket) Qin >>> >>> On Fri, Nov 23, 2018 at 1:55 PM jincheng sun <sunjincheng...@gmail.com> >>> wrote: >>> >>>> Hi Jiangjie, >>>> >>>> Thank you for the explanation about the name of `cache()`, I understand >> why >>>> you designed this way! >>>> >>>> Another idea is whether we can specify a lifecycle for data persistence? >>>> For example, persist (LifeCycle.SESSION), so that the user is not >> worried >>>> about data loss, and will clearly specify the time range for keeping >> time. >>>> At the same time, if we want to expand, we can also share in a certain >>>> group of session, for example: LifeCycle.SESSION_GROUP(...), I am not >> sure, >>>> just an immature suggestion, for reference only! >>>> >>>> Bests, >>>> Jincheng >>>> >>>> Becket Qin <becket....@gmail.com> 于2018年11月23日周五 下午1:33写道: >>>> >>>>> Re: Jincheng, >>>>> >>>>> Thanks for the feedback. Regarding cache() v.s. persist(), personally I >>>>> find cache() to be more accurately describing the behavior, i.e. the >>>> Table >>>>> is cached for the session, but will be deleted after the session is >>>> closed. >>>>> persist() seems a little misleading as people might think the table >> will >>>>> still be there even after the session is gone. >>>>> >>>>> Great point about mixing the batch and stream processing in the same >> job. >>>>> We should absolutely move towards that goal. I imagine that would be a >>>> huge >>>>> change across the board, including sources, operators and >> optimizations, >>>> to >>>>> name some. Likely we will need several separate in-depth discussions. >>>>> >>>>> Thanks, >>>>> >>>>> Jiangjie (Becket) Qin >>>>> >>>>> On Fri, Nov 23, 2018 at 5:14 AM Xingcan Cui <xingc...@gmail.com> >> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> @Shaoxuan, I think the lifecycle or access domain are both orthogonal >>>> to >>>>>> the cache problem. Essentially, this may be the first time we plan to >>>>>> introduce another storage mechanism other than the state. Maybe it’s >>>>> better >>>>>> to first draw a big picture and then concentrate on a specific part? >>>>>> >>>>>> @Becket, yes, actually I am more concerned with the underlying >> service. >>>>>> This seems to be quite a major change to the existing codebase. As you >>>>>> claimed, the service should be extendible to support other components >>>> and >>>>>> we’d better discussed it in another thread. >>>>>> >>>>>> All in all, I also eager to enjoy the more interactive Table API, in >>>> case >>>>>> of a general and flexible enough service mechanism. >>>>>> >>>>>> Best, >>>>>> Xingcan >>>>>> >>>>>>> On Nov 22, 2018, at 10:16 AM, Xiaowei Jiang <xiaow...@gmail.com> >>>>> wrote: >>>>>>> >>>>>>> Relying on a callback for the temp table for clean up is not very >>>>>> reliable. >>>>>>> There is no guarantee that it will be executed successfully. We may >>>>> risk >>>>>>> leaks when that happens. I think that it's safer to have an >>>> association >>>>>>> between temp table and session id. So we can always clean up temp >>>>> tables >>>>>>> which are no longer associated with any active sessions. >>>>>>> >>>>>>> Regards, >>>>>>> Xiaowei >>>>>>> >>>>>>> On Thu, Nov 22, 2018 at 12:55 PM jincheng sun < >>>>> sunjincheng...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Jiangjie&Shaoxuan, >>>>>>>> >>>>>>>> Thanks for initiating this great proposal! >>>>>>>> >>>>>>>> Interactive Programming is very useful and user friendly in case of >>>>> your >>>>>>>> examples. >>>>>>>> Moreover, especially when a business has to be executed in several >>>>>> stages >>>>>>>> with dependencies,such as the pipeline of Flink ML, in order to >>>>> utilize >>>>>> the >>>>>>>> intermediate calculation results we have to submit a job by >>>>>> env.execute(). >>>>>>>> >>>>>>>> About the `cache()` , I think is better to named `persist()`, And >>>> The >>>>>>>> Flink framework determines whether we internally cache in memory or >>>>>> persist >>>>>>>> to the storage system,Maybe save the data into state backend >>>>>>>> (MemoryStateBackend or RocksDBStateBackend etc.) >>>>>>>> >>>>>>>> BTW, from the points of my view in the future, support for streaming >>>>> and >>>>>>>> batch mode switching in the same job will also benefit in >>>> "Interactive >>>>>>>> Programming", I am looking forward to your JIRAs and FLIP! >>>>>>>> >>>>>>>> Best, >>>>>>>> Jincheng >>>>>>>> >>>>>>>> >>>>>>>> Becket Qin <becket....@gmail.com> 于2018年11月20日周二 下午9:56写道: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> As a few recent email threads have pointed out, it is a promising >>>>>>>>> opportunity to enhance Flink Table API in various aspects, >>>> including >>>>>>>>> functionality and ease of use among others. One of the scenarios >>>>> where >>>>>> we >>>>>>>>> feel Flink could improve is interactive programming. To explain the >>>>>>>> issues >>>>>>>>> and facilitate the discussion on the solution, we put together the >>>>>>>>> following document with our proposal. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >> https://docs.google.com/document/d/1d4T2zTyfe7hdncEUAxrlNOYr4e5IMNEZLyqSuuswkA0/edit?usp=sharing >>>>>>>>> >>>>>>>>> Feedback and comments are very welcome! >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Jiangjie (Becket) Qin >>>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>> >> >>