I do actually think that all levels of abstraction have their value. If you wish, we have (top to bottom):
(1) SQL (2) Table API with simplified Stream/Table duality (3) DataStream API / windows (4) DataStream API with custom windows and triggers (5) Custom Operators The Data Scientist may not like (5), but there sure is a bunch of people that just want the basic fabric (streams and fault tolerant state) to stitch together whatever they want. I think it would be great to present this layering and simply say "pick your entry point!" The abstraction of (5) is actually similar to what the operator-centric (non-fluent) API of Kafka Streams is. It is only slightly more involved in Flink, because it exposes too many internals. With some simple wrapper/template, this can be a full-fledged API function, or a separate API in itself. Stephan On Wed, Aug 17, 2016 at 6:16 PM, Martin Neumann <mneum...@sics.se> wrote: > I agree with Vasia that for data scientist it's likely easier to learn the > high-level api. I like the material from > http://dataartisans.github.io/flink-training/ but all of them focus on the > high level api. > > Maybe we could have a guide (blog post, lecture, whatever) on how to get > into Flink as a Storm guy. That would allow people with that background to > directly dive into the lower level api working with models similar to what > they were used to. I would volunteer but I'm not familiar with Storm. > > I for my part, always had to use some lower level api in all of my > application, most of the time pestering Aljioscha about the details. So > either I'm the exception or there is a need for more complex examples > showcasing the lower level api methods. > One of the things I have been using in several pipelines so far is > extracting the start and end timestamp from a window adding it to the > window aggregate. Maybe something like this could be a useful example to > include into the training. > > Side question: > I assume there are recurring design patterns in stream applications user > develop. Is there any chance we will be able to identify or create some > design patterns (similar to java design pattern). That would make it easier > to use the lower level api and might help people to avoid pitfalls like the > one Alijosha mentioned. > > cheers Martin > PS: I hope its fine for me to butt into the discussion like this. > > On Tue, Aug 16, 2016 at 4:34 PM, Wright, Eron <ewri...@live.com> wrote: > > > Jamie, > > I think you raise a valid concern but I would hesitate to accept the > > suggestion that the low-level API be promoted to app developers. > > > > Higher-level abstractions tend to be more constrained and more optimized, > > whereas lower-level abstractions tend to be more powerful, be more > > laborious to use and provide the system with less knowledge. It is a > > classic tradeoff. > > > > I think it important to consider, what are the important/distinguishing > > characteristics of the Flink framework. Exactly-once guarantees, > > event-time support, support for job upgrade without data loss, fault > > tolerance, etc. I’m speculating that the high-level abstraction > provided > > to app developers is probably needed to retain those charactistics. > > > > I think Vasia makes a good point that SQL might be a good alternative way > > to ease into Flink. > > > > Finally, I believe the low-level API is primarily intended for extension > > purposes (connectors, operations, etc) not app development. It could > use > > better documentation to ensure that third-party extensions support those > > key characteristics. > > > > -Eron > > > > > On Aug 16, 2016, at 3:12 AM, Vasiliki Kalavri < > vasilikikala...@gmail.com> > > wrote: > > > > > > Hi Jamie, > > > > > > thanks for sharing your thoughts on this! You're raising some > interesting > > > points. > > > > > > Whether users find the lower-level primitives more intuitive depends on > > > their background I believe. From what I've seen, if users are coming > from > > > the S4/Storm world and are used to the "compositional" way of > streaming, > > > then indeed it's easier for them to think and operate on that level. > > These > > > are usually people who have seen/built streaming things before trying > out > > > Flink. > > > But if we're talking about analysts and people coming from the "batch" > > way > > > of thinking or people used to working with SQL/python, then the > > > higher-level declarative API is probably easier to understand. > > > > > > I do think that we should make the lower-level API more visible and > > > document it properly, but I'm not sure if we should teach Flink on this > > > level first. I think that presenting it as a set of "advanced" features > > > makes more sense actually. > > > > > > Cheers, > > > -Vasia. > > > > > > On 16 August 2016 at 04:24, Jamie Grier <ja...@data-artisans.com> > wrote: > > > > > >> You lost me at lattice, Aljoscha ;) > > >> > > >> I do think something like the more powerful N-way FlatMap w/ Timers > > >> Aljoscha is describing here would probably solve most of the problem. > > >> Often Flink's higher level primitives work well for people and that's > > >> great. It's just that I also spend a fair amount of time discussing > > with > > >> people how to map what they know they want to do onto operations that > > >> aren't a perfect fit and it sometimes liberates them when they realize > > they > > >> can just implement it the way they want by dropping down a level. > They > > >> usually don't go there themselves, though. > > >> > > >> I mention teaching this "first" and then the higher layers I guess > > because > > >> that's just a matter of teaching philosophy. I think it's good to to > > see > > >> the basic operations that are available first and then understand that > > the > > >> other abstractions are built on top of that. That way you're not > > afraid to > > >> drop-down to basics when you know what you want to get done. > > >> > > >> -Jamie > > >> > > >> > > >> On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek < > aljos...@apache.org> > > >> wrote: > > >> > > >>> Hi All, > > >>> I also thought about this recently. A good think would be to add a > good > > >>> user facing operator that behaves more or less like an enhanced > FlatMap > > >>> with multiple inputs, multiple outputs, state access and keyed > timers. > > >> I'm > > >>> a bit hesitant, though, since users rarely think about the > implications > > >>> that come with state updating and out-of-order events. If you don't > > >>> implement a stateful operator correctly you have pretty much > arbitrary > > >>> results. > > >>> > > >>> The problem with out-of-order event arrival and state update is that > > the > > >>> state basically has to monotonically transition "upwards" through a > > >> lattice > > >>> for the computation to make sense. I know this sounds rather > > theoretical > > >> so > > >>> I'll try to explain with an example. Say you have an operator that > > waits > > >>> for timestamped elements A, B, C to arrive in timestamp order and > then > > >> does > > >>> some processing. The naive approach would be to have a small state > > >> machine > > >>> that tracks what element you have seen so far. The state machine has > > >> three > > >>> states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is > > supposed > > >>> to traverse these states linearly as the elements arrive. This > doesn't > > >>> work, however, when elements arrive in an order that does not match > > their > > >>> timestamp order. What the user should do is to have a "Set" state > that > > >>> keeps track of the elements that it has seen. Once it has seen {A, B, > > C} > > >>> the operator must check the timestamps and then do the processing, if > > >>> required. The set of possible combinations of A, B, and C forms a > > lattice > > >>> when combined with the "subset" operation. And traversal through > these > > >> sets > > >>> is monotonically "upwards" so it works regardless of the order that > the > > >>> elements arrive in. (I recently pointed this out on the Beam mailing > > list > > >>> and Kenneth Knowles rightly pointed out that what I was describing > was > > in > > >>> fact a lattice.) > > >>> > > >>> I know this is a bit off-topic but I think it's very easy for users > to > > >>> write wrong operations when they are dealing with state. We should > > still > > >>> have a good API for it, though. Just wanted to make people aware of > > this. > > >>> > > >>> Cheers, > > >>> Aljoscha > > >>> > > >>> On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <mj...@apache.org> > wrote: > > >>> > > >>>> It really depends on the skill level of the developer. Using > low-level > > >>>> API requires to think about many details (eg. state handling etc.) > > that > > >>>> could be done wrong. > > >>>> > > >>>> As Flink gets a broader community, more people will use it who might > > >> not > > >>>> have the required skill level to deal with low-level API. For more > > >>>> trained uses, it is of course a powerful tool! > > >>>> > > >>>> I guess it boils down to the question, what type of developer Flink > > >>>> targets, if low-level API should be offensive advertised or not. > Also > > >>>> keep in mind, that many people criticized Storm's low-level API as > > hard > > >>>> to program etc. > > >>>> > > >>>> > > >>>> -Matthias > > >>>> > > >>>> On 08/15/2016 07:46 AM, Gyula Fóra wrote: > > >>>>> Hi Jamie, > > >>>>> > > >>>>> I agree that it is often much easier to work on the lower level > APIs > > >> if > > >>>> you > > >>>>> know what you are doing. > > >>>>> > > >>>>> I think it would be nice to have very clean abstractions on that > > >> level > > >>> so > > >>>>> we could teach this to the users first but currently I thinm its > not > > >>> easy > > >>>>> enough to be good starting point. > > >>>>> > > >>>>> The user needs to understand a lot about the system if the dont > want > > >> to > > >>>>> hurt other parts of the pipeline. For insance working with the > > >>>>> streamrecords, propagating watermarks, working with state internals > > >>>>> > > >>>>> This all might be overwhelming at the first glance. But maybe we > can > > >>> slim > > >>>>> some abstractions down to the point where this becomes kind of the > > >>>>> extension of the RichFunctions. > > >>>>> > > >>>>> Cheers, > > >>>>> Gyula > > >>>>> > > >>>>> On Sat, Aug 13, 2016, 17:48 Jamie Grier <ja...@data-artisans.com> > > >>> wrote: > > >>>>> > > >>>>>> Hey all, > > >>>>>> > > >>>>>> I've noticed a few times now when trying to help users implement > > >>>> particular > > >>>>>> things in the Flink API that it can be complicated to map what > they > > >>> know > > >>>>>> they are trying to do onto higher-level Flink concepts such as > > >>>> windowing or > > >>>>>> Connect/CoFlatMap/ValueState, etc. > > >>>>>> > > >>>>>> At some point it just becomes easier to think about writing a > Flink > > >>>>>> operator yourself that is integrated into the pipeline with a > > >>>> transform() > > >>>>>> call. > > >>>>>> > > >>>>>> It can just be easier to think at a more basic level. For > example I > > >>> can > > >>>>>> write an operator that can consume one or two input streams > (should > > >>>>>> probably be N), update state which is managed for me fault > > >> tolerantly, > > >>>> and > > >>>>>> output elements or setup timers/triggers that give me callbacks > from > > >>>> which > > >>>>>> I can also update state or emit elements. > > >>>>>> > > >>>>>> When you think at this level you realize you can program just > about > > >>>>>> anything you want. You can create whatever fault-tolerant data > > >>>> structures > > >>>>>> you want, and easily execute robust stateful computation over data > > >>>> streams > > >>>>>> at scale. This is the real technology and power of Flink IMO. > > >>>>>> > > >>>>>> Also, at this level I don't have to think about the complexities > of > > >>>>>> windowing semantics, learn as much API, etc. I can easily have > some > > >>>> inputs > > >>>>>> that are broadcast, others that are keyed, manage my own state in > > >>>> whatever > > >>>>>> data structure makes sense, etc. If I know exactly what I > actually > > >>>> want to > > >>>>>> do I can just do it with the full power of my chosen language, > data > > >>>>>> structures, etc. I'm not "restricted" to trying to map everything > > >>> onto > > >>>>>> higher-level Flink constructs which is sometimes actually more > > >>>> complicated. > > >>>>>> > > >>>>>> Programming at this level is actually fairly easy to do but people > > >>> seem > > >>>> a > > >>>>>> bit afraid of this level of the API. They think of it as > low-level > > >> or > > >>>>>> custom hacking.. > > >>>>>> > > >>>>>> Anyway, I guess my thought is this.. Should we explain Flink to > > >>> people > > >>>> at > > >>>>>> this level *first*? Show that you have nearly unlimited power and > > >>>>>> flexibility to build what you want *and only then* from there > > >> explain > > >>>> the > > >>>>>> higher level APIs they can use *if* those match their use cases > > >> well. > > >>>>>> > > >>>>>> Would this better demonstrate to people the power of Flink and > maybe > > >>>>>> *liberate* them a bit from feeling they have to map their problem > > >>> onto a > > >>>>>> more complex set of higher level primitives? I see people trying > to > > >>>>>> shoe-horn what they are really trying to do, which is simple to > > >>> explain > > >>>> in > > >>>>>> english, onto windows, triggers, CoFlatMaps, etc, and this get's > > >>>>>> complicated sometimes. It's like an impedance mismatch. You > could > > >>> just > > >>>>>> solve the problem very easily programmed in straight Java/Scala. > > >>>>>> > > >>>>>> Anyway, it's very easy to drop down a level in the API and program > > >>>> whatever > > >>>>>> you want but users don't seem to *perceive* it that way. > > >>>>>> > > >>>>>> Just some thoughts... Any feedback? Have any of you had similar > > >>>>>> experiences when working with newer Flink users or as a newer > Flink > > >>> user > > >>>>>> yourself? Can/should we do anything to make the *lower* level API > > >>> more > > >>>>>> accessible/visible to users? > > >>>>>> > > >>>>>> -Jamie > > >>>>>> > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > >> > > >> > > >> -- > > >> > > >> Jamie Grier > > >> data Artisans, Director of Applications Engineering > > >> @jamiegrier <https://twitter.com/jamiegrier> > > >> ja...@data-artisans.com > > >> > > > > >