Re: [DISCUSS] Flink ML roadmap

Till Rohrmann Fri, 10 Mar 2017 05:40:14 -0800

Hi Roberto,

jpmml looks quite promising and this could be a first step towards the
model serving story. Thus, looking really forward seeing it being open
sourced by you guys :-)


@Katherin, I'm not saying that there is no interest in the community to
work on batch features. However, there is simply not much capacity left to
mentor such an effort at the moment. I fear without the mentoring from an
experienced contributor who has worked on the batch part, it will be
extremely hard to get such a change into the code base. But this will
hopefully change in the future.

I think the discussion from this thread moved over to [1] and I will
continue discussing there.

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Machine-Learning-on-Flink-Next-steps-td16334.html#none

Cheers,
Till

On Wed, Mar 8, 2017 at 1:59 AM, Kavulya, Soila P <soila.p.kavu...@intel.com>
wrote:

> Hi Theodore,
>
> We had put together a proposal for an ML DSL in Apache Beam. We had
> developed a couple of scoring engines as part of TAP https://github.com/
> tapanalyticstoolkit/model-scoring-java and https://github.com/
> tapanalyticstoolkit/scoring-pipelines. However, our group is no longer
> actively developing them.
>
> Thanks,
>
> Soila
>
> From: Theodore Vasiloudis [mailto:theodoros.vasilou...@gmail.com]
> Sent: Friday, March 3, 2017 4:11 AM
> To: dev@flink.apache.org
> Cc: Kavulya, Soila P <soila.p.kavu...@intel.com>
> Subject: Re: [DISCUSS] Flink ML roadmap
>
> It seems like a relatively new project, backed by Intel.
>
> My impression from the doc Roberto linked is that they might switch to
> using Beam instead of Spark (?)
> I'm cc'ing Soila who is developer of TAP and has worked on FlinkML in the
> past, perhaps she has some input on how they plan to work with streaming
> and ML in TAP.
>
> Repos:
> [1] https://github.com/tapanalyticstoolkit/
>
> On Fri, Mar 3, 2017 at 12:24 PM, Stavros Kontopoulos <
> st.kontopou...@gmail.com<mailto:st.kontopou...@gmail.com>> wrote:
> Interesting  thanx @Roberto.  I see that only TAP Analytics Toolkit
> supports streaming. I am not aware of its market share, anyone?
>
> Best,
> Stavros
>
> On Fri, Mar 3, 2017 at 11:50 AM, Theodore Vasiloudis <
> theodoros.vasilou...@gmail.com<mailto:theodoros.vasilou...@gmail.com>>
> wrote:
>
> > Thank you for the links Roberto I did not know that Beam was working on
> an
> > ML abstraction as well. I'm sure we can learn from that.
> >
> > I'll start another thread today where we can discuss next steps and
> action
> > points now that we have a few different paths to follow listed on the
> > shared doc,
> > since our deadline was today. We welcome further discussions of course.
> >
> > Regards,
> > Theodore
> >
> > On Thu, Mar 2, 2017 at 10:52 AM, Roberto Bentivoglio <
> > roberto.bentivog...@radicalbit.io<mailto:roberto.
> bentivog...@radicalbit.io>> wrote:
> >
> > > Hi All,
> > >
> > > First of all I'd like to introduce myself: my name is Roberto
> Bentivoglio
> > > and I'm currently working for Radicalbit as Andrea Spina (he already
> > wrote
> > > on this thread).
> > > I didn't have the chance to directly contribute on Flink up to now, but
> > > some colleagues of mine are doing that since at least one year (they
> > > contributed also on the machine learning library).
> > >
> > > I hope I'm not jumping into discussione too late, it's really
> interesting
> > > and the analysis document is depicting really well the scenarios
> > currently
> > > available. Many thanks for your effort!
> > >
> > > If I can add my two cents to the discussion I'd like to add the
> > following:
> > >  - it's clear that currently the Flink community is deeply focused on
> > > streaming features than batch features. For this reason I think that
> > > implement "Offline learning with Streaming API" is really a great idea.
> > >  - I think that the "Online learning" option is really a good fit for
> > > Flink, but maybe we could give at the beginning an higher priority to
> the
> > > "Offline learning with Streaming API" option. However I think that this
> > > option will be the main goal for the mid/long term.
> > >  - we implemented a library based on jpmml-evaluator[1] and flink
> called
> > > "flink-jpmml". Using this library you can train the models on external
> > > systems and use those models, after you've exported in a PMML standard
> > > format, to run evaluations on top of DataStream API. We don't have open
> > > sourced this library up to now, but we're planning to do this in the
> next
> > > weeks. We'd like to complete the documentation and the final code
> reviews
> > > before to share it. I hope it will be helpful for the community to
> > enhance
> > > the ML support on Flink
> > >  - I'd like also to mention that the Apache Beam community is thiking
> on
> > a
> > > ML DSL. There is a design document and a couple of Jira tasks for that
> > > [2][3]
> > >
> > > We're really keen to focus our effort to improve the ML support on
> Flink
> > in
> > > Radicalbit, we will contribute on this effort for sure on a regular
> basis
> > > with our team.
> > >
> > > Looking forward to work with you!
> > >
> > > Many thanks,
> > > Roberto
> > >
> > > [1] - https://github.com/jpmml/jpmml-evaluator
> > > [2] -
> > > https://docs.google.com/document/d/17cRZk_
> yqHm3C0fljivjN66MbLkeKS1yjo4PB
> > > ECHb-xA
> > > [3] - https://issues.apache.org/jira/browse/BEAM-303
> > >
> > > On 28 February 2017 at 19:35, Gábor Hermann <m...@gaborhermann.com
> <mailto:m...@gaborhermann.com>>
> > wrote:
> > >
> > > > Hi Philipp,
> > > >
> > > > It's great to hear you are interested in Flink ML!
> > > >
> > > > Based on your description, your prototype seems like an interesting
> > > > approach for combining online+offline learning. If you're interested,
> > we
> > > > might find a way to integrate your work, or at least your ideas, into
> > > Flink
> > > > ML if we decide on a direction that fits your approach. I think your
> > work
> > > > could be relevant for almost all the directions listed there (if I
> > > > understand correctly you'd even like to serve predictions on
> unlabeled
> > > > data).
> > > >
> > > > Feel free to join the discussion in the docs you've mentioned :)
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > >
> > > > On 2017-02-27 18:39, Philipp Zehnder wrote:
> > > >
> > > > Hello all,
> > > >>
> > > >> I’m new to this mailing list and I wanted to introduce myself. My
> name
> > > is
> > > >> Philipp Zehnder and I’m a Masters Student in Computer Science at the
> > > >> Karlsruhe Institute of Technology in Germany currently writing on my
> > > >> master’s thesis with the main goal to integrate reusable machine
> > > learning
> > > >> components into a stream processing network. One part of my thesis
> is
> > to
> > > >> create an API for distributed online machine learning.
> > > >>
> > > >> I saw that there are some recent discussions how to continue the
> > > >> development of Flink ML [1] and I want to share some of my
> experiences
> > > and
> > > >> maybe get some feedback from the community for my ideas.
> > > >>
> > > >> As I am new to open source projects I hope this is the right place
> for
> > > >> this.
> > > >>
> > > >> In the beginning, I had a look at different already existing
> > frameworks
> > > >> like Apache SAMOA for example, which is great and has a lot of
> useful
> > > >> resources. However, as Flink is currently focusing on streaming,
> from
> > my
> > > >> point of view it makes sense to also have a streaming machine
> learning
> > > API
> > > >> as part of the Flink ecosystem.
> > > >>
> > > >> I’m currently working on building a prototype for a distributed
> > > streaming
> > > >> machine learning library based on Flink that can be used for online
> > and
> > > >> “classical” offline learning.
> > > >>
> > > >> The machine learning algorithm takes labeled and non-labeled data.
> On
> > a
> > > >> labeled data point first a prediction is performed and then this
> label
> > > is
> > > >> used to train the model. On a non-labeled data point just a
> prediction
> > > is
> > > >> performed. The main difference between the online and offline
> > > algorithms is
> > > >> that in the offline case the labeled data must be handed to the
> model
> > > >> before the unlabeled data. In the online case, it is still possible
> to
> > > >> process labeled data at a later point to update the model. The
> > > advantage of
> > > >> this approach is that batch algorithms can be applied on streaming
> > data
> > > as
> > > >> well as online algorithms can be supported.
> > > >>
> > > >> One difference to batch learning are the transformers that are used
> to
> > > >> preprocess the data. For example, a simple mean subtraction must be
> > > >> implemented with a rolling mean, because we can’t calculate the mean
> > > over
> > > >> all the data, but the Flink Streaming API is perfect for that. It
> > would
> > > be
> > > >> useful for users to have an extensible toolbox of transformers.
> > > >>
> > > >> Another difference is the evaluation of the models. As we don’t
> have a
> > > >> single value to determine the model quality, in streaming scenarios
> > this
> > > >> value evolves over time when it sees more labeled data.
> > > >>
> > > >> However, the transformation and evaluation works again similar in
> both
> > > >> online learning and offline learning.
> > > >>
> > > >> I also liked the discussion in [2] and I think that the competition
> in
> > > >> the batch learning field is hard and there are already a lot of
> great
> > > >> projects. I think it is true that in most real world problems it is
> > not
> > > >> necessary to update the model immediately, but there are a lot of
> use
> > > cases
> > > >> for machine learning on streams. For them it would be nice to have a
> > > native
> > > >> approach.
> > > >>
> > > >> A stream machine learning API for Flink would fit very well and I
> > would
> > > >> also be willing to contribute to the future development of the Flink
> > ML
> > > >> library.
> > > >>
> > > >>
> > > >>
> > > >> Best regards,
> > > >>
> > > >> Philipp
> > > >>
> > > >> [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > >> com/DISCUSS-Flink-ML-roadmap-td16040.html <
> > > http://apache-flink-mailing-l
> > > >> ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap-<
> http://ist-archive.1008284.n3.nabble.com/DISCUSS-Flink-ML-roadmap->
> > td16040.html
> > > >
> > > >> [2] https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > > >> 49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2 <
> > > >> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQ
> > > >> c49h3Ud06MIRhahtJ6dw/edit#heading=h.v9v1aj3xosv2>
> > > >>
> > > >>
> > > >> Am 23.02.2017 um 15:48 schrieb Gábor Hermann <m...@gaborhermann.com
> <mailto:m...@gaborhermann.com>>:
> > > >>>
> > > >>> Okay, I've created a skeleton of the design doc for choosing a
> > > direction:
> > > >>> https://docs.google.com/document/d/1afQbvZBTV15qF3vobVWUjxQc
> > > >>> 49h3Ud06MIRhahtJ6dw/edit?usp=sharing
> > > >>>
> > > >>> Much of the pros/cons have already been discussed here, so I'll try
> > to
> > > >>> put there all the arguments mentioned in this thread. Feel free to
> > put
> > > >>> there more :)
> > > >>>
> > > >>> @Stavros: I agree we should take action fast. What about collecting
> > our
> > > >>> thoughts in the doc by around Tuesday next week (28. February)?
> Then
> > > decide
> > > >>> on the direction and design a roadmap by around Friday (3. March)?
> Is
> > > that
> > > >>> feasible, or should it take more time?
> > > >>>
> > > >>> I think it will be necessary to have a shepherd, or even better a
> > > >>> committer, to be involved in at least reviewing and accepting the
> > > roadmap.
> > > >>> It would be best, if a committer coordinated all this.
> > > >>> @Theodore: Would you like to do the coordination?
> > > >>>
> > > >>> Regarding the use-cases: I've seen some abstracts of talks at SF
> > Flink
> > > >>> Forward [1] that seem promising. There are companies already using
> > > Flink
> > > >>> for ML [2,3,4,5].
> > > >>>
> > > >>> [1] http://sf.flink-forward.org/program/sessions/
> > > >>> [2] http://sf.flink-forward.org/kb_sessions/experiences-with-str
> > > >>> eaming-vs-micro-batch-for-online-learning/
> > > >>> [3] http://sf.flink-forward.org/kb_sessions/introducing-flink-te
> > > >>> nsorflow/
> > > >>> [4] http://sf.flink-forward.org/kb_sessions/non-flink-machine-le
> > > >>> arning-on-flink/
> > > >>> [5] http://sf.flink-forward.org/kb_sessions/streaming-deep-learn
> > > >>> ing-scenarios-with-flink/
> > > >>>
> > > >>> Cheers,
> > > >>> Gabor
> > > >>>
> > > >>>
> > > >>> On 2017-02-23 15:19, Katherin Eri wrote:
> > > >>>
> > > >>>> I have asked already some teams for useful cases, but all of them
> > need
> > > >>>> time
> > > >>>> to think.
> > > >>>> During analysis something will finally arise.
> > > >>>> May be we can ask partners of Flink  for cases? Data Artisans got
> > > >>>> results
> > > >>>> of customers survey: [1], ML better support is wanted, so we could
> > ask
> > > >>>> what
> > > >>>> exactly is necessary.
> > > >>>>
> > > >>>> [1] http://data-artisans.com/flink-user-survey-2016-part-2/
> > > >>>>
> > > >>>> 23 февр. 2017 г. 4:32 PM пользователь "Stavros Kontopoulos" <
> > > >>>> st.kontopou...@gmail.com<mailto:st.kontopou...@gmail.com>>
> написал:
> > > >>>>
> > > >>>> +100 for a design doc.
> > > >>>>>
> > > >>>>> Could we also set a roadmap after some time-boxed investigation
> > > >>>>> captured in
> > > >>>>> that document? We need action.
> > > >>>>>
> > > >>>>> Looking forward to work on this (whatever that might be) ;) Also
> > are
> > > >>>>> there
> > > >>>>> any data supporting one direction or the other from a customer
> > > >>>>> perspective?
> > > >>>>> It would help to make more informed decisions.
> > > >>>>>
> > > >>>>> On Thu, Feb 23, 2017 at 2:23 PM, Katherin Eri <
> > > katherinm...@gmail.com<mailto:katherinm...@gmail.com>>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>> Yes, ok.
> > > >>>>>> let's start some design document, and write down there already
> > > >>>>>> mentioned
> > > >>>>>> ideas about: parameter server, about clipper and others. Would
> be
> > > >>>>>> nice if
> > > >>>>>> we will also map this approaches to cases.
> > > >>>>>> Will work on it collaboratively on each topic, may be finally we
> > > will
> > > >>>>>>
> > > >>>>> form
> > > >>>>>
> > > >>>>>> some picture, that could be agreed with committers.
> > > >>>>>> @Gabor, could you please start such shared doc, as you have
> > already
> > > >>>>>>
> > > >>>>> several
> > > >>>>>
> > > >>>>>> ideas proposed?
> > > >>>>>>
> > > >>>>>> чт, 23 февр. 2017, 15:06 Gábor Hermann <m...@gaborhermann.com
> <mailto:m...@gaborhermann.com>>:
> > > >>>>>>
> > > >>>>>> I agree, that it's better to go in one direction first, but I
> > think
> > > >>>>>>> online and offline with streaming API can go somewhat parallel
> > > later.
> > > >>>>>>>
> > > >>>>>> We
> > > >>>>>
> > > >>>>>> could set a short-term goal, concentrate initially on one
> > direction,
> > > >>>>>>>
> > > >>>>>> and
> > > >>>>>
> > > >>>>>> showcase that direction (e.g. in a blogpost). But first, we
> should
> > > >>>>>>> list
> > > >>>>>>> the pros/cons in a design doc as a minimum. Then make a
> decision
> > > what
> > > >>>>>>> direction to go. Would that be feasible?
> > > >>>>>>>
> > > >>>>>>> On 2017-02-23 12:34, Katherin Eri wrote:
> > > >>>>>>>
> > > >>>>>>> I'm not sure that this is feasible, doing all at the same time
> > > could
> > > >>>>>>>>
> > > >>>>>>> mean
> > > >>>>>>
> > > >>>>>>> doing nothing((((
> > > >>>>>>>> I'm just afraid, that words: we will work on streaming not on
> > > >>>>>>>>
> > > >>>>>>> batching,
> > > >>>>>
> > > >>>>>> we
> > > >>>>>>>
> > > >>>>>>>> have no commiter's time for this, mean that yes, we started
> work
> > > on
> > > >>>>>>>> FLINK-1730, but nobody will commit this work in the end, as it
> > > >>>>>>>>
> > > >>>>>>> already
> > > >>>>>
> > > >>>>>> was
> > > >>>>>>>
> > > >>>>>>>> with this ticket.
> > > >>>>>>>>
> > > >>>>>>>> 23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <
> > > >>>>>>>>
> > > >>>>>>> m...@gaborhermann.com<mailto:m...@gaborhermann.com>>
> > > >>>>>>>
> > > >>>>>>>> написал:
> > > >>>>>>>>
> > > >>>>>>>> @Theodore: Great to hear you think the "batch on streaming"
> > > approach
> > > >>>>>>>>>
> > > >>>>>>>> is
> > > >>>>>>
> > > >>>>>>> possible! Of course, we need to pay attention all the pitfalls
> > > >>>>>>>>>
> > > >>>>>>>> there,
> > > >>>>>
> > > >>>>>> if we
> > > >>>>>>>
> > > >>>>>>>> go that way.
> > > >>>>>>>>>
> > > >>>>>>>>> +1 for a design doc!
> > > >>>>>>>>>
> > > >>>>>>>>> I would add that it's possible to make efforts in all the
> three
> > > >>>>>>>>>
> > > >>>>>>>> directions
> > > >>>>>>>
> > > >>>>>>>> (i.e. batch, online, batch on streaming) at the same time.
> > > Although,
> > > >>>>>>>>>
> > > >>>>>>>> it
> > > >>>>>>
> > > >>>>>>> might be worth to concentrate on one. E.g. it would not be so
> > > useful
> > > >>>>>>>>>
> > > >>>>>>>> to
> > > >>>>>>
> > > >>>>>>> have the same batch algorithms with both the batch API and
> > > streaming
> > > >>>>>>>>>
> > > >>>>>>>> API.
> > > >>>>>>>
> > > >>>>>>>> We can decide later.
> > > >>>>>>>>>
> > > >>>>>>>>> The design doc could be partitioned to these 3 directions,
> and
> > we
> > > >>>>>>>>>
> > > >>>>>>>> can
> > > >>>>>
> > > >>>>>> collect there the pros/cons too. What do you think?
> > > >>>>>>>>>
> > > >>>>>>>>> Cheers,
> > > >>>>>>>>> Gabor
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On 2017-02-23 12:13, Theodore Vasiloudis wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> Hello all,
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> @Gabor, we have discussed the idea of using the streaming
> API
> > to
> > > >>>>>>>>>>
> > > >>>>>>>>> write
> > > >>>>>>
> > > >>>>>>> all
> > > >>>>>>>
> > > >>>>>>>> of our ML algorithms with a couple of people offline,
> > > >>>>>>>>>> and I think it might be possible and is generally worth a
> > shot.
> > > >>>>>>>>>> The
> > > >>>>>>>>>> approach we would take would be close to Vowpal Wabbit, not
> > > >>>>>>>>>> exactly
> > > >>>>>>>>>> "online", but rather "fast-batch".
> > > >>>>>>>>>>
> > > >>>>>>>>>> There will be problems popping up again, even for very
> simple
> > > >>>>>>>>>> algos
> > > >>>>>>>>>>
> > > >>>>>>>>> like
> > > >>>>>>>
> > > >>>>>>>> on
> > > >>>>>>>>>> line linear regression with SGD [1], but hopefully fixing
> > those
> > > >>>>>>>>>>
> > > >>>>>>>>> will
> > > >>>>>
> > > >>>>>> be
> > > >>>>>>
> > > >>>>>>> more aligned with the priorities of the community.
> > > >>>>>>>>>>
> > > >>>>>>>>>> @Katherin, my understanding is that given the limited
> > resources,
> > > >>>>>>>>>>
> > > >>>>>>>>> there
> > > >>>>>>
> > > >>>>>>> is
> > > >>>>>>>
> > > >>>>>>>> no development effort focused on batch processing right now.
> > > >>>>>>>>>>
> > > >>>>>>>>>> So to summarize, it seems like there are people willing to
> > work
> > > on
> > > >>>>>>>>>>
> > > >>>>>>>>> ML
> > > >>>>>
> > > >>>>>> on
> > > >>>>>>>
> > > >>>>>>>> Flink, but nobody is sure how to do it.
> > > >>>>>>>>>> There are many directions we could take (batch, online,
> batch
> > on
> > > >>>>>>>>>> streaming), each with its own merits and downsides.
> > > >>>>>>>>>>
> > > >>>>>>>>>> If you want we can start a design doc and move the
> > conversation
> > > >>>>>>>>>>
> > > >>>>>>>>> there,
> > > >>>>>>
> > > >>>>>>> come
> > > >>>>>>>>>> up with a roadmap and start implementing.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regards,
> > > >>>>>>>>>> Theodore
> > > >>>>>>>>>>
> > > >>>>>>>>>> [1]
> > > >>>>>>>>>> http://apache-flink-user-mailing-list-archive.2336050.n4.
> > > >>>>>>>>>> nabble.com/Understanding-connected-streams-use-without-
> times<http://nabble.com/Understanding-connected-streams-use-without-times>
> > > >>>>>>>>>> tamps-td10241.html
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <
> > > >>>>>>>>>>
> > > >>>>>>>>> m...@gaborhermann.com<mailto:m...@gaborhermann.com>
> > > >>>>>>
> > > >>>>>>> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> It's great to see so much activity in this discussion :)
> > > >>>>>>>>>>
> > > >>>>>>>>>>> I'll try to add my thoughts.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I think building a developer community (Till's 2. point)
> can
> > be
> > > >>>>>>>>>>>
> > > >>>>>>>>>> slightly
> > > >>>>>>>
> > > >>>>>>>> separated from what features we should aim for (1. point) and
> > > >>>>>>>>>>>
> > > >>>>>>>>>> showcasing
> > > >>>>>>>
> > > >>>>>>>> (3. point). Thanks Till for bringing up the ideas for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> restructuring,
> > > >>>>>
> > > >>>>>> I'm
> > > >>>>>>>
> > > >>>>>>>> sure we'll find a way to make the development process more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> dynamic.
> > > >>>>>
> > > >>>>>> I'll
> > > >>>>>>>
> > > >>>>>>>> try to address the rest here.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> It's hard to choose directions between streaming and batch
> > ML.
> > > As
> > > >>>>>>>>>>>
> > > >>>>>>>>>> Theo
> > > >>>>>>
> > > >>>>>>> has
> > > >>>>>>>>>>> indicated, not much online ML is used in production, but
> > Flink
> > > >>>>>>>>>>> concentrates
> > > >>>>>>>>>>> on streaming, so online ML would be a better fit for Flink.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> However,
> > > >>>>>
> > > >>>>>> as
> > > >>>>>>>
> > > >>>>>>>> most of you argued, there's definite need for batch ML. But
> > batch
> > > >>>>>>>>>>>
> > > >>>>>>>>>> ML
> > > >>>>>
> > > >>>>>> seems
> > > >>>>>>>>>>> hard to achieve because there are blocking issues with
> > > >>>>>>>>>>> persisting,
> > > >>>>>>>>>>> iteration paths etc. So it's no good either way.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> I propose a seemingly crazy solution: what if we developed
> > > batch
> > > >>>>>>>>>>> algorithms also with the streaming API? The batch API would
> > > >>>>>>>>>>>
> > > >>>>>>>>>> clearly
> > > >>>>>
> > > >>>>>> seem
> > > >>>>>>>
> > > >>>>>>>> more suitable for ML algorithms, but there a lot of benefits
> of
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>
> > > >>>>>> approach too, so it's clearly worth considering. Flink also has
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>
> > > >>>>>> high
> > > >>>>>>>
> > > >>>>>>>> level vision of "streaming for everything" that would clearly
> > fit
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>>
> > > >>>>>>> case. What do you all think about this? Do you think this
> > solution
> > > >>>>>>>>>>>
> > > >>>>>>>>>> would
> > > >>>>>>>
> > > >>>>>>>> be
> > > >>>>>>>>>>> feasible? I would be happy to make a more elaborate
> proposal,
> > > but
> > > >>>>>>>>>>>
> > > >>>>>>>>>> I
> > > >>>>>
> > > >>>>>> push
> > > >>>>>>>
> > > >>>>>>>> my
> > > >>>>>>>>>>> main ideas here:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 1) Simplifying by using one system
> > > >>>>>>>>>>> It could simplify the work of both the users and the
> > > developers.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> One
> > > >>>>>
> > > >>>>>> could
> > > >>>>>>>>>>> execute training once, or could execute it periodically
> e.g.
> > by
> > > >>>>>>>>>>>
> > > >>>>>>>>>> using
> > > >>>>>>
> > > >>>>>>> windows. Low-latency serving and training could be done in the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> same
> > > >>>>>
> > > >>>>>> system.
> > > >>>>>>>>>>> We could implement incremental algorithms, without any side
> > > >>>>>>>>>>> inputs
> > > >>>>>>>>>>>
> > > >>>>>>>>>> for
> > > >>>>>>
> > > >>>>>>> combining online learning (or predictions) with batch learning.
> > Of
> > > >>>>>>>>>>> course,
> > > >>>>>>>>>>> all the logic describing these must be somehow implemented
> > > (e.g.
> > > >>>>>>>>>>> synchronizing predictions with training), but it should be
> > > easier
> > > >>>>>>>>>>>
> > > >>>>>>>>>> to
> > > >>>>>
> > > >>>>>> do
> > > >>>>>>>
> > > >>>>>>>> so
> > > >>>>>>>>>>> in one system, than by combining e.g. the batch and
> streaming
> > > >>>>>>>>>>> API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 2) Batch ML with the streaming API is not harder
> > > >>>>>>>>>>> Despite these benefits, it could seem harder to implement
> > batch
> > > >>>>>>>>>>> ML
> > > >>>>>>>>>>>
> > > >>>>>>>>>> with
> > > >>>>>>>
> > > >>>>>>>> the streaming API, but in my opinion it's not. There are more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> flexible,
> > > >>>>>>>
> > > >>>>>>>> lower-level optimization potentials with the streaming API.
> Most
> > > >>>>>>>>>>> distributed ML algorithms use a lower-level model than the
> > > batch
> > > >>>>>>>>>>>
> > > >>>>>>>>>> API
> > > >>>>>
> > > >>>>>> anyway, so sometimes it feels like forcing the algorithm logic
> > > >>>>>>>>>>>
> > > >>>>>>>>>> into
> > > >>>>>
> > > >>>>>> the
> > > >>>>>>>
> > > >>>>>>>> training API and tweaking it. Although we could not use the
> > batch
> > > >>>>>>>>>>> primitives like join, we would have the E.g. in my
> experience
> > > >>>>>>>>>>> with
> > > >>>>>>>>>>> implementing a distributed matrix factorization algorithm
> > [1],
> > > I
> > > >>>>>>>>>>>
> > > >>>>>>>>>> couldn't
> > > >>>>>>>
> > > >>>>>>>> do a simple optimization because of the limitations of the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> iteration
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> [2]. Even if we pushed all the development effort to make the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> more suitable for ML there would be things we couldn't do.
> E.g.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> there
> > > >>>>>>
> > > >>>>>>> are
> > > >>>>>>>
> > > >>>>>>>> approaches for updating a model iteratively without locks
> [3,4]
> > > >>>>>>>>>>>
> > > >>>>>>>>>> (i.e.
> > > >>>>>>
> > > >>>>>>> somewhat asynchronously), and I don't see a clear way to
> > implement
> > > >>>>>>>>>>>
> > > >>>>>>>>>> such
> > > >>>>>>>
> > > >>>>>>>> algorithms with the batch API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 3) Streaming community (users and devs) benefit
> > > >>>>>>>>>>> The Flink streaming community in general would also benefit
> > > from
> > > >>>>>>>>>>>
> > > >>>>>>>>>> this
> > > >>>>>>
> > > >>>>>>> direction. There are many features needed in the streaming API
> > for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> ML
> > > >>>>>>
> > > >>>>>>> to
> > > >>>>>>>
> > > >>>>>>>> work, but this is also true for the batch API. One really
> > > >>>>>>>>>>>
> > > >>>>>>>>>> important
> > > >>>>>
> > > >>>>>> is
> > > >>>>>>
> > > >>>>>>> the
> > > >>>>>>>>>>> loops API (a.k.a. iterative DataStreams) [5]. There has
> been
> > a
> > > >>>>>>>>>>> lot
> > > >>>>>>>>>>>
> > > >>>>>>>>>> of
> > > >>>>>>
> > > >>>>>>> effort (mostly from Paris) for making it mature enough [6].
> Kate
> > > >>>>>>>>>>> mentioned
> > > >>>>>>>>>>> using GPUs, and I'm sure they have uses in streaming
> > generally
> > > >>>>>>>>>>>
> > > >>>>>>>>>> [7].
> > > >>>>>
> > > >>>>>> Thus,
> > > >>>>>>>
> > > >>>>>>>> by improving the streaming API to allow ML algorithms, the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> streaming
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> benefit too (which is important as they have a lot more
> > production
> > > >>>>>>>>>>>
> > > >>>>>>>>>> users
> > > >>>>>>>
> > > >>>>>>>> than the batch API).
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 4) Performance can be at least as good
> > > >>>>>>>>>>> I believe the same performance could be achieved with the
> > > >>>>>>>>>>>
> > > >>>>>>>>>> streaming
> > > >>>>>
> > > >>>>>> API
> > > >>>>>>>
> > > >>>>>>>> as
> > > >>>>>>>>>>> with the batch API. Streaming API is much closer to the
> > runtime
> > > >>>>>>>>>>>
> > > >>>>>>>>>> than
> > > >>>>>
> > > >>>>>> the
> > > >>>>>>>
> > > >>>>>>>> batch API. For corner-cases, with runtime-layer optimizations
> of
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>>
> > > >>>>>>> API,
> > > >>>>>>>>>>> we could find a way to do the same (or similar)
> optimization
> > > for
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>
> > > >>>>>> streaming API (see my previous point). Such case could be using
> > > >>>>>>>>>>>
> > > >>>>>>>>>> managed
> > > >>>>>>>
> > > >>>>>>>> memory (and spilling to disk). There are also benefits by
> > default,
> > > >>>>>>>>>>>
> > > >>>>>>>>>> e.g.
> > > >>>>>>>
> > > >>>>>>>> we
> > > >>>>>>>>>>> would have a finer grained fault tolerance with the
> streaming
> > > >>>>>>>>>>> API.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 5) We could keep batch ML API
> > > >>>>>>>>>>> For the shorter term, we should not throw away all the
> > > algorithms
> > > >>>>>>>>>>> implemented with the batch API. By pushing forward the
> > > >>>>>>>>>>> development
> > > >>>>>>>>>>>
> > > >>>>>>>>>> with
> > > >>>>>>>
> > > >>>>>>>> side inputs we could make them usable with streaming API.
> Then,
> > if
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>>
> > > >>>>>>> library gains some popularity, we could replace the algorithms
> in
> > > >>>>>>>>>>>
> > > >>>>>>>>>> the
> > > >>>>>>
> > > >>>>>>> batch
> > > >>>>>>>>>>> API with streaming ones, to avoid the performance costs of
> > e.g.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> not
> > > >>>>>
> > > >>>>>> being
> > > >>>>>>>
> > > >>>>>>>> able to persist.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 6) General tools for implementing ML algorithms
> > > >>>>>>>>>>> Besides implementing algorithms one by one, we could give
> > more
> > > >>>>>>>>>>>
> > > >>>>>>>>>> general
> > > >>>>>>
> > > >>>>>>> tools for making it easier to implement algorithms. E.g.
> > parameter
> > > >>>>>>>>>>>
> > > >>>>>>>>>> server
> > > >>>>>>>
> > > >>>>>>>> [8,9]. Theo also mentioned in another thread that TensorFlow
> > has a
> > > >>>>>>>>>>> similar
> > > >>>>>>>>>>> model to Flink streaming, we could look into that too. I
> > think
> > > >>>>>>>>>>>
> > > >>>>>>>>>> often
> > > >>>>>
> > > >>>>>> when
> > > >>>>>>>
> > > >>>>>>>> deploying a production ML system, much more configuration and
> > > >>>>>>>>>>>
> > > >>>>>>>>>> tweaking
> > > >>>>>>
> > > >>>>>>> should be done than e.g. Spark MLlib allows. Why not allow
> that?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> 7) Showcasing
> > > >>>>>>>>>>> Showcasing this could be easier. We could say that we're
> > doing
> > > >>>>>>>>>>>
> > > >>>>>>>>>> batch
> > > >>>>>
> > > >>>>>> ML
> > > >>>>>>>
> > > >>>>>>>> with a streaming API. That's interesting in its own. IMHO this
> > > >>>>>>>>>>> integration
> > > >>>>>>>>>>> is also a more approachable way towards end-to-end ML.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks for reading so far :)
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> [1] https://github.com/apache/flink/pull/2819
> > > >>>>>>>>>>> [2] https://issues.apache.org/jira/browse/FLINK-2396
> > > >>>>>>>>>>> [3] https://people.eecs.berkeley.
> > edu/~brecht/papers/hogwildTR.
> > > pd
> > > >>>>>>>>>>> f
> > > >>>>>>>>>>> [4] https://www.usenix.org/system/
> > > files/conference/hotos13/hotos
> > > >>>>>>>>>>> 13-final77.pdf
> > > >>>>>>>>>>> [5] https://cwiki.apache.org/
> confluence/display/FLINK/FLIP-
> > 15+
> > > >>>>>>>>>>> Scoped+Loops+and+Job+Termination
> > > >>>>>>>>>>> [6] https://github.com/apache/flink/pull/1668
> > > >>>>>>>>>>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-
> > > sigmod16.
> > > >>>>>>>>>>>
> > > >>>>>>>>>> pdf
> > > >>>>>
> > > >>>>>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.
> pdf
> > > >>>>>>>>>>> [9] http://apache-flink-mailing-
> > list-archive.1008284.n3.nabble
> > > .
> > > >>>>>>>>>>> com/Using-QueryableState-inside-Flink-jobs-and-
> > > >>>>>>>>>>> Parameter-Server-implementation-td15880.html
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Cheers,
> > > >>>>>>>>>>> Gabor
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> --
> > > >>>>>>>
> > > >>>>>> *Yours faithfully, *
> > > >>>>>>
> > > >>>>>> *Kate Eri.*
> > > >>>>>>
> > > >>>>>>
> > > >>
> > > >
> > >
> > >
> > > --
> > > Roberto Bentivoglio
> > > CTO
> > > e. roberto.bentivog...@radicalbit.io<mailto:roberto.
> bentivog...@radicalbit.io>
> > > Radicalbit S.r.l.
> > > radicalbit.io<http://radicalbit.io>
> > >
> >
>
>

Re: [DISCUSS] Flink ML roadmap

Reply via email to