Re: Some thoughts about the lower-level Flink APIs

Stephan Ewen Wed, 17 Aug 2016 09:35:53 -0700

I do actually think that all levels of abstraction have their value.
If you wish, we have (top to bottom):


(1) SQL

(2) Table API with simplified Stream/Table duality

(3) DataStream API / windows

(4) DataStream API with custom windows and triggers

(5) Custom Operators


The Data Scientist may not like (5), but there sure is a bunch of people
that just want the basic fabric (streams and fault tolerant state) to
stitch together whatever they want.

I think it would be great to present this layering and simply say "pick
your entry point!"

The abstraction of (5) is actually similar to what the operator-centric
(non-fluent) API of Kafka Streams is. It is only slightly more involved in
Flink, because it exposes too many internals. With some simple
wrapper/template, this can be a full-fledged API function, or a separate
API in itself.

Stephan



On Wed, Aug 17, 2016 at 6:16 PM, Martin Neumann <mneum...@sics.se> wrote:

> I agree with Vasia that for data scientist it's likely easier to learn the
> high-level api. I like the material from
> http://dataartisans.github.io/flink-training/ but all of them focus on the
> high level api.
>
> Maybe we could have a guide (blog post, lecture, whatever) on how to get
> into Flink as a Storm guy. That would allow people with that background to
> directly dive into the lower level api working with models similar to what
> they were used to. I would volunteer but I'm not familiar with Storm.
>
> I for my part, always had to use some lower level api in all of my
> application, most of the time pestering Aljioscha about the details. So
> either I'm the exception or there is a need for more complex examples
> showcasing the lower level api methods.
> One of the things I have been using in several pipelines so far is
> extracting the start and end timestamp from a window adding it to the
> window aggregate. Maybe something like this could be a useful example to
> include into the training.
>
> Side question:
> I assume there are recurring design patterns in stream applications user
> develop. Is there any chance we will be able to identify or create some
> design patterns (similar to java design pattern). That would make it easier
> to use the lower level api and might help people to avoid pitfalls like the
> one Alijosha mentioned.
>
> cheers Martin
> PS: I hope its fine for me to butt into the discussion like this.
>
> On Tue, Aug 16, 2016 at 4:34 PM, Wright, Eron <ewri...@live.com> wrote:
>
> > Jamie,
> > I think you raise a valid concern but I would hesitate to accept the
> > suggestion that the low-level API be promoted to app developers.
> >
> > Higher-level abstractions tend to be more constrained and more optimized,
> > whereas lower-level abstractions tend to be more powerful, be more
> > laborious to use and provide the system with less knowledge.   It is a
> > classic tradeoff.
> >
> > I think it important to consider, what are the important/distinguishing
> > characteristics of the Flink framework.    Exactly-once guarantees,
> > event-time support, support for job upgrade without data loss, fault
> > tolerance, etc.    I’m speculating that the high-level abstraction
> provided
> > to app developers is probably needed to retain those charactistics.
> >
> > I think Vasia makes a good point that SQL might be a good alternative way
> > to ease into Flink.
> >
> > Finally, I believe the low-level API is primarily intended for extension
> > purposes (connectors, operations, etc) not app development.    It could
> use
> > better documentation to ensure that third-party extensions support those
> > key characteristics.
> >
> > -Eron
> >
> > > On Aug 16, 2016, at 3:12 AM, Vasiliki Kalavri <
> vasilikikala...@gmail.com>
> > wrote:
> > >
> > > Hi Jamie,
> > >
> > > thanks for sharing your thoughts on this! You're raising some
> interesting
> > > points.
> > >
> > > Whether users find the lower-level primitives more intuitive depends on
> > > their background I believe. From what I've seen, if users are coming
> from
> > > the S4/Storm world and are used to the "compositional" way of
> streaming,
> > > then indeed it's easier for them to think and operate on that level.
> > These
> > > are usually people who have seen/built streaming things before trying
> out
> > > Flink.
> > > But if we're talking about analysts and people coming from the "batch"
> > way
> > > of thinking or people used to working with SQL/python, then the
> > > higher-level declarative API is probably easier to understand.
> > >
> > > I do think that we should make the lower-level API more visible and
> > > document it properly, but I'm not sure if we should teach Flink on this
> > > level first. I think that presenting it as a set of "advanced" features
> > > makes more sense actually.
> > >
> > > Cheers,
> > > -Vasia.
> > >
> > > On 16 August 2016 at 04:24, Jamie Grier <ja...@data-artisans.com>
> wrote:
> > >
> > >> You lost me at lattice, Aljoscha ;)
> > >>
> > >> I do think something like the more powerful N-way FlatMap w/ Timers
> > >> Aljoscha is describing here would probably solve most of the problem.
> > >> Often Flink's higher level primitives work well for people and that's
> > >> great.  It's just that I also spend a fair amount of time discussing
> > with
> > >> people how to map what they know they want to do onto operations that
> > >> aren't a perfect fit and it sometimes liberates them when they realize
> > they
> > >> can just implement it the way they want by dropping down a level.
> They
> > >> usually don't go there themselves, though.
> > >>
> > >> I mention teaching this "first" and then the higher layers I guess
> > because
> > >> that's just a matter of teaching philosophy.  I think it's good to to
> > see
> > >> the basic operations that are available first and then understand that
> > the
> > >> other abstractions are built on top of that.  That way you're not
> > afraid to
> > >> drop-down to basics when you know what you want to get done.
> > >>
> > >> -Jamie
> > >>
> > >>
> > >> On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek <
> aljos...@apache.org>
> > >> wrote:
> > >>
> > >>> Hi All,
> > >>> I also thought about this recently. A good think would be to add a
> good
> > >>> user facing operator that behaves more or less like an enhanced
> FlatMap
> > >>> with multiple inputs, multiple outputs, state access and keyed
> timers.
> > >> I'm
> > >>> a bit hesitant, though, since users rarely think about the
> implications
> > >>> that come with state updating and out-of-order events. If you don't
> > >>> implement a stateful operator correctly you have pretty much
> arbitrary
> > >>> results.
> > >>>
> > >>> The problem with out-of-order event arrival and state update is that
> > the
> > >>> state basically has to monotonically transition "upwards" through a
> > >> lattice
> > >>> for the computation to make sense. I know this sounds rather
> > theoretical
> > >> so
> > >>> I'll try to explain with an example. Say you have an operator that
> > waits
> > >>> for timestamped elements A, B, C to arrive in timestamp order and
> then
> > >> does
> > >>> some processing. The naive approach would be to have a small state
> > >> machine
> > >>> that tracks what element you have seen so far. The state machine has
> > >> three
> > >>> states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is
> > supposed
> > >>> to traverse these states linearly as the elements arrive. This
> doesn't
> > >>> work, however, when elements arrive in an order that does not match
> > their
> > >>> timestamp order. What the user should do is to have a "Set" state
> that
> > >>> keeps track of the elements that it has seen. Once it has seen {A, B,
> > C}
> > >>> the operator must check the timestamps and then do the processing, if
> > >>> required. The set of possible combinations of A, B, and C forms a
> > lattice
> > >>> when combined with the "subset" operation. And traversal through
> these
> > >> sets
> > >>> is monotonically "upwards" so it works regardless of the order that
> the
> > >>> elements arrive in. (I recently pointed this out on the Beam mailing
> > list
> > >>> and Kenneth Knowles rightly pointed out that what I was describing
> was
> > in
> > >>> fact a lattice.)
> > >>>
> > >>> I know this is a bit off-topic but I think it's very easy for users
> to
> > >>> write wrong operations when they are dealing with state. We should
> > still
> > >>> have a good API for it, though. Just wanted to make people aware of
> > this.
> > >>>
> > >>> Cheers,
> > >>> Aljoscha
> > >>>
> > >>> On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <mj...@apache.org>
> wrote:
> > >>>
> > >>>> It really depends on the skill level of the developer. Using
> low-level
> > >>>> API requires to think about many details (eg. state handling etc.)
> > that
> > >>>> could be done wrong.
> > >>>>
> > >>>> As Flink gets a broader community, more people will use it who might
> > >> not
> > >>>> have the required skill level to deal with low-level API. For more
> > >>>> trained uses, it is of course a powerful tool!
> > >>>>
> > >>>> I guess it boils down to the question, what type of developer Flink
> > >>>> targets, if low-level API should be offensive advertised or not.
> Also
> > >>>> keep in mind, that many people criticized Storm's low-level API as
> > hard
> > >>>> to program etc.
> > >>>>
> > >>>>
> > >>>> -Matthias
> > >>>>
> > >>>> On 08/15/2016 07:46 AM, Gyula Fóra wrote:
> > >>>>> Hi Jamie,
> > >>>>>
> > >>>>> I agree that it is often much easier to work on the lower level
> APIs
> > >> if
> > >>>> you
> > >>>>> know what you are doing.
> > >>>>>
> > >>>>> I think it would be nice to have very clean abstractions on that
> > >> level
> > >>> so
> > >>>>> we could teach this to the users first but currently I thinm its
> not
> > >>> easy
> > >>>>> enough to be good starting point.
> > >>>>>
> > >>>>> The user needs to understand a lot about the system if the dont
> want
> > >> to
> > >>>>> hurt other parts of the pipeline. For insance working with the
> > >>>>> streamrecords, propagating watermarks, working with state internals
> > >>>>>
> > >>>>> This all might be overwhelming at the first glance. But maybe we
> can
> > >>> slim
> > >>>>> some abstractions down to the point where this becomes kind of the
> > >>>>> extension of the RichFunctions.
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Gyula
> > >>>>>
> > >>>>> On Sat, Aug 13, 2016, 17:48 Jamie Grier <ja...@data-artisans.com>
> > >>> wrote:
> > >>>>>
> > >>>>>> Hey all,
> > >>>>>>
> > >>>>>> I've noticed a few times now when trying to help users implement
> > >>>> particular
> > >>>>>> things in the Flink API that it can be complicated to map what
> they
> > >>> know
> > >>>>>> they are trying to do onto higher-level Flink concepts such as
> > >>>> windowing or
> > >>>>>> Connect/CoFlatMap/ValueState, etc.
> > >>>>>>
> > >>>>>> At some point it just becomes easier to think about writing a
> Flink
> > >>>>>> operator yourself that is integrated into the pipeline with a
> > >>>> transform()
> > >>>>>> call.
> > >>>>>>
> > >>>>>> It can just be easier to think at a more basic level.  For
> example I
> > >>> can
> > >>>>>> write an operator that can consume one or two input streams
> (should
> > >>>>>> probably be N), update state which is managed for me fault
> > >> tolerantly,
> > >>>> and
> > >>>>>> output elements or setup timers/triggers that give me callbacks
> from
> > >>>> which
> > >>>>>> I can also update state or emit elements.
> > >>>>>>
> > >>>>>> When you think at this level you realize you can program just
> about
> > >>>>>> anything you want.  You can create whatever fault-tolerant data
> > >>>> structures
> > >>>>>> you want, and easily execute robust stateful computation over data
> > >>>> streams
> > >>>>>> at scale.  This is the real technology and power of Flink IMO.
> > >>>>>>
> > >>>>>> Also, at this level I don't have to think about the complexities
> of
> > >>>>>> windowing semantics, learn as much API, etc.  I can easily have
> some
> > >>>> inputs
> > >>>>>> that are broadcast, others that are keyed, manage my own state in
> > >>>> whatever
> > >>>>>> data structure makes sense, etc.  If I know exactly what I
> actually
> > >>>> want to
> > >>>>>> do I can just do it with the full power of my chosen language,
> data
> > >>>>>> structures, etc.  I'm not "restricted" to trying to map everything
> > >>> onto
> > >>>>>> higher-level Flink constructs which is sometimes actually more
> > >>>> complicated.
> > >>>>>>
> > >>>>>> Programming at this level is actually fairly easy to do but people
> > >>> seem
> > >>>> a
> > >>>>>> bit afraid of this level of the API.  They think of it as
> low-level
> > >> or
> > >>>>>> custom hacking..
> > >>>>>>
> > >>>>>> Anyway, I guess my thought is this..  Should we explain Flink to
> > >>> people
> > >>>> at
> > >>>>>> this level *first*?  Show that you have nearly unlimited power and
> > >>>>>> flexibility to build what you want *and only then* from there
> > >> explain
> > >>>> the
> > >>>>>> higher level APIs they can use *if* those match their use cases
> > >> well.
> > >>>>>>
> > >>>>>> Would this better demonstrate to people the power of Flink and
> maybe
> > >>>>>> *liberate* them a bit from feeling they have to map their problem
> > >>> onto a
> > >>>>>> more complex set of higher level primitives?  I see people trying
> to
> > >>>>>> shoe-horn what they are really trying to do, which is simple to
> > >>> explain
> > >>>> in
> > >>>>>> english, onto windows, triggers, CoFlatMaps, etc, and this get's
> > >>>>>> complicated sometimes.  It's like an impedance mismatch.  You
> could
> > >>> just
> > >>>>>> solve the problem very easily programmed in straight Java/Scala.
> > >>>>>>
> > >>>>>> Anyway, it's very easy to drop down a level in the API and program
> > >>>> whatever
> > >>>>>> you want but users don't seem to *perceive* it that way.
> > >>>>>>
> > >>>>>> Just some thoughts...  Any feedback?  Have any of you had similar
> > >>>>>> experiences when working with newer Flink users or as a newer
> Flink
> > >>> user
> > >>>>>> yourself?  Can/should we do anything to make the *lower* level API
> > >>> more
> > >>>>>> accessible/visible to users?
> > >>>>>>
> > >>>>>> -Jamie
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >>
> > >> Jamie Grier
> > >> data Artisans, Director of Applications Engineering
> > >> @jamiegrier <https://twitter.com/jamiegrier>
> > >> ja...@data-artisans.com
> > >>
> >
> >
>

Re: Some thoughts about the lower-level Flink APIs

Reply via email to