Re: [DISCUSS] Flink ML roadmap

Stavros Kontopoulos Mon, 20 Feb 2017 08:01:56 -0800

I think Flink ML could be a success. Many use cases out there could benefit
from such algorithms especially online ones.
I agree examples should be created showing how it could be used.


I was not aware of the project re-structuring issues. GPUs is really
important nowdays but it is still not the major reason for not adopting
Flink ML. Flink ML has to be developed further and promoted as previously
stated.

In the meantime as for the reviewing part I am investing time there, so I
would like to see if we can join forces and push stuff.

I am aware of the evaluation framework PR and I will review it this week
hopefully. Bu can we commit on pushing anything given the load people have?

As another option could we propose someone to be the committer there as
well, someone Till will guide if it is needed?

I think we dont need to wait for all issues to be solved first. As for the
big picture re-use makes sense but I think the end result should be
something that benefits
Flink.  I would like to stay in Flink as much as possible from a
UX/features side of view. Of course people already use a number of
libraries for years and what we do by implementing the algorithms is
getting those algorithms to work on large datasets plus for streaming,
keeping the UX familiar at the same time.

I think connecting to external libraries should be done if possible for
things not being your domain like dbs or dfs etc... Is it a domain related
for a streaming engine? Use cases drive that IMHO... Again implementation
should be justified by user needs, if there is no such need no reason to
implement anything.

Just some thoughts...


On Mon, Feb 20, 2017 at 3:39 PM, Timur Shenkao <t...@timshenkao.su> wrote:

> Hello guys,
>
> My couple of cents.
> All Flink presentations, articles, etc. articulate that Flink is for ETL,
> data ingestion. CEP is a maximum.
> If you visit http://flink.apache.org/usecases.html, you'll there aren't
> any
> explicit ML or Graphs there.
> It's also stated that Flink is suitable when "Data that is processed
> quickly".
> That's why people believe that Flink isn't for ML or don't even know that
> Flink has such algorithms.
> Then, folks decide: "I would better use old good Spark or scikit-learn than
> dive into Flink's internals & implement algo by myself "
>
> Sincerely yours, Timur
>
> On Mon, Feb 20, 2017 at 1:53 PM, Katherin Eri <katherinm...@gmail.com>
> wrote:
>
> > Hello guys,
> >
> >
> > May be we will be able to focus our forces on some E2E scenario or show
> > case for Flink as also ML supporting engine, and in such a way actualize
> > the roadmap?
> >
> >
> > This means: we can take some real life/production problem, like Fraud
> > detection in some area, and try to solve this problem from the point of
> > view of DataScience.
> >
> > Starting from data preprocessing and preparation, finishing
> > implementation/usage of some ML algorithm.
> >
> > Doing this we will understand which issues are showstopper for
> > implementation of such functionality. We will be able to understand
> Flink’s
> > users better.
> >
> >
> > May be community could share its ideas which show case could be the most
> > useful for Apache Flink, or may be Data artisans could lead this?
> >
> > пн, 20 февр. 2017 г. в 15:28, Theodore Vasiloudis <
> > theodoros.vasilou...@gmail.com>:
> >
> > > Hello all,
> > >
> > > thank you for opening this discussion Stavros, note that it's almost
> > > exactly 1 year since I last opened such a topic (linked by Gabor) and
> the
> > > comments there are still relevant.
> > >
> > > I think Gabor described the current state quite well, development in
> the
> > > libraries is hard without committers dedicated to each project, and as
> a
> > > result FlinkML and CEP have stalled.
> > >
> > > I think it's important to look at why development has stalled as well.
> As
> > > people have mentioned there's a multitude of ML libraries out there and
> > my
> > > impression was that not many people are looking to use Flink for ML.
> > Lately
> > > that seems to have changed (with some interest shown in the Flink
> survey
> > as
> > > well).
> > >
> > > Gabor makes some good points about future directions for the library.
> Our
> > > initial goal [1] was to make a truly scalable, easy to use library,
> > within
> > > the Flink ecosystem, providing a set of "workhorse" algorithms, sampled
> > > from what's actually being used in the industry. We planned for a
> library
> > > that has few algorithms, but does them properly.
> > >
> > > If we decide to go the way of focusing within Flink we face some major
> > > challenges, because these are system limitations that do not
> necessarily
> > > align with the goals of the community. Some issues relevant to ML on
> > Flink
> > > are:
> > >
> > >    - FLINK-2396 - Review the datasets of dynamic path and static path
> in
> > >    iteration.
> > >    https://issues.apache.org/jira/browse/FLINK-2396
> > >    This has to do with the ability to iterate over one datset (model)
> > while
> > >    changing another (dataset), which is necessary for many ML
> algorithms
> > > like
> > >    SGD.
> > >    - FLINK-1730 - Add a FlinkTools.persist style method to the Data
> Set.
> > >    https://issues.apache.org/jira/browse/FLINK-1730
> > >    This is again relevant to many algorithms, to create intermediate
> > >    results etc, for example L-BFGS development has been attempted 2-3
> > > times,
> > >    but always abandoned because of the need to collect a DataSet kills
> > the
> > >    performance.
> > >    - FLINK-5782 - Support GPU calculations
> > >    https://issues.apache.org/jira/browse/FLINK-5782
> > >    Many algorithms will benefit greatly by GPU-accelerated linear
> > algebra,
> > >    to the point where if a library doesn't support it puts it at a
> severe
> > >    disadvantage compared to other offerings.
> > >
> > >
> > > These issues aside, Stephan has mentioned recently the possibility of
> > > re-structuring the Flink project to allow for more flexibility for the
> > > libraries. I think that sounds quite promising and it should allow the
> > > development to pick up in the libraries, if we can get some more people
> > > reviewing and merging PRs.
> > >
> > > I would be all for updating our vision and roadmap to match what the
> > > community desires from the library.
> > >
> > > [1]
> > >
> > > https://cwiki.apache.org/confluence/display/FLINK/
> > FlinkML%3A+Vision+and+Roadmap
> > >
> > > On Mon, Feb 20, 2017 at 12:47 PM, Gábor Hermann <m...@gaborhermann.com
> >
> > > wrote:
> > >
> > > > Hi Stavros,
> > > >
> > > > Thanks for bringing this up.
> > > >
> > > > There have been past [1] and recent [2, 3] discussions about the
> Flink
> > > > libraries, because there are some stalling PRs and overloaded
> > committers.
> > > > (Actually, Till is the only committer shepherd of the both the CEP
> and
> > ML
> > > > library, and AFAIK he has a ton of other responsibilities and work to
> > > do.)
> > > > Thus it's hard to get code reviewed and merged, and without merged
> code
> > > > it's hard to get a committer status, so there are not many committers
> > who
> > > > can review e.g. ML algorithm implementations, and the cycle goes on.
> > > Until
> > > > this is resolved somehow, we should help the committers by reviewing
> > > > each-others PRs.
> > > >
> > > > I think prioritizing features (b) is a good way to start. We could
> > > declare
> > > > most blocking features and concentrate on reviewing and merging them
> > > before
> > > > moving forward. E.g. the evaluation framework is quite important for
> an
> > > ML
> > > > library in my opinion, and has a PR stalling for long [4].
> > > >
> > > > Regarding c),  there are styleguides generally for contributing to
> > Flink,
> > > > so we should follow that. Is there something more ML specific you
> think
> > > we
> > > > could follow? We should definitely declare, we follow scikit-learn
> and
> > > make
> > > > sure contributions comply to that.
> > > >
> > > > In terms of features (a, d), I think we should first see the bigger
> > > > picture. That is, it would be nice to discuss a clearer direction for
> > > Flink
> > > > ML. I've seen a lot of interest in contributing to Flink ML lately. I
> > > > believe we should rethink our goals, to put the contribution efforts
> in
> > > > making a usable and useful library. Are we trying to implement as
> many
> > > > useful algorithms as possible to create a scalable ML library? That
> > would
> > > > seem ambitious, and of course there are a lot of frameworks and
> > libraries
> > > > that already has something like this as goal (e.g. Spark MLlib,
> > Mahout).
> > > > Should we rather create connectors to existing libraries? Then we
> > cannot
> > > > really do Flink specific optimizations. Should we go for online
> machine
> > > > learning (as Flink is concentrating on streaming)? We already have a
> > > > connector to SAMOA. We could go on with questions like this. Maybe
> I'm
> > > > missing something, but I haven't seen such directions declared.
> > > >
> > > > Cheers,
> > > > Gabor
> > > >
> > > > [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > > com/Opening-a-discussion-on-FlinkML-td10265.html
> > > > [2] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > > com/Flink-CEP-development-is-stalling-td15237.html#a15341
> > > > [3] http://apache-flink-mailing-list-archive.1008284.n3.nabble.
> > > > com/New-Flink-team-member-Kate-Eri-td15349.html
> > > > [4] https://github.com/apache/flink/pull/1849
> > > >
> > > >
> > > > On 2017-02-20 11:43, Stavros Kontopoulos wrote:
> > > >
> > > > (Resending with the appropriate topic)
> > > >>
> > > >> Hi,
> > > >>
> > > >> I would like to start a discussion about next steps for Flink ML.
> > > >> Currently there is a lot of work going on but needs a push forward.
> > > >>
> > > >> Some topics to discuss:
> > > >>
> > > >> a) How several features should be planned and get aligned with Flink
> > > >> releases.
> > > >> b) Priorities of what should be done.
> > > >> c) Basic guidelines for code: styleguides, scikit-learn compliance
> etc
> > > >> d) Missing features important for the success of the library, next
> > steps
> > > >> etc...
> > > >>
> > > >> Thoughts?
> > > >>
> > > >> Best,
> > > >> Stavros
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Flink ML roadmap

Reply via email to