I'm not sure that this is feasible, doing all at the same time could mean doing nothing(((( I'm just afraid, that words: we will work on streaming not on batching, we have no commiter's time for this, mean that yes, we started work on FLINK-1730, but nobody will commit this work in the end, as it already was with this ticket.
23 февр. 2017 г. 14:26 пользователь "Gábor Hermann" <m...@gaborhermann.com> написал: > @Theodore: Great to hear you think the "batch on streaming" approach is > possible! Of course, we need to pay attention all the pitfalls there, if we > go that way. > > +1 for a design doc! > > I would add that it's possible to make efforts in all the three directions > (i.e. batch, online, batch on streaming) at the same time. Although, it > might be worth to concentrate on one. E.g. it would not be so useful to > have the same batch algorithms with both the batch API and streaming API. > We can decide later. > > The design doc could be partitioned to these 3 directions, and we can > collect there the pros/cons too. What do you think? > > Cheers, > Gabor > > > On 2017-02-23 12:13, Theodore Vasiloudis wrote: > >> Hello all, >> >> >> @Gabor, we have discussed the idea of using the streaming API to write all >> of our ML algorithms with a couple of people offline, >> and I think it might be possible and is generally worth a shot. The >> approach we would take would be close to Vowpal Wabbit, not exactly >> "online", but rather "fast-batch". >> >> There will be problems popping up again, even for very simple algos like >> on >> line linear regression with SGD [1], but hopefully fixing those will be >> more aligned with the priorities of the community. >> >> @Katherin, my understanding is that given the limited resources, there is >> no development effort focused on batch processing right now. >> >> So to summarize, it seems like there are people willing to work on ML on >> Flink, but nobody is sure how to do it. >> There are many directions we could take (batch, online, batch on >> streaming), each with its own merits and downsides. >> >> If you want we can start a design doc and move the conversation there, >> come >> up with a roadmap and start implementing. >> >> Regards, >> Theodore >> >> [1] >> http://apache-flink-user-mailing-list-archive.2336050.n4. >> nabble.com/Understanding-connected-streams-use-without-times >> tamps-td10241.html >> >> On Tue, Feb 21, 2017 at 11:17 PM, Gábor Hermann <m...@gaborhermann.com> >> wrote: >> >> It's great to see so much activity in this discussion :) >>> I'll try to add my thoughts. >>> >>> I think building a developer community (Till's 2. point) can be slightly >>> separated from what features we should aim for (1. point) and showcasing >>> (3. point). Thanks Till for bringing up the ideas for restructuring, I'm >>> sure we'll find a way to make the development process more dynamic. I'll >>> try to address the rest here. >>> >>> It's hard to choose directions between streaming and batch ML. As Theo >>> has >>> indicated, not much online ML is used in production, but Flink >>> concentrates >>> on streaming, so online ML would be a better fit for Flink. However, as >>> most of you argued, there's definite need for batch ML. But batch ML >>> seems >>> hard to achieve because there are blocking issues with persisting, >>> iteration paths etc. So it's no good either way. >>> >>> I propose a seemingly crazy solution: what if we developed batch >>> algorithms also with the streaming API? The batch API would clearly seem >>> more suitable for ML algorithms, but there a lot of benefits of this >>> approach too, so it's clearly worth considering. Flink also has the high >>> level vision of "streaming for everything" that would clearly fit this >>> case. What do you all think about this? Do you think this solution would >>> be >>> feasible? I would be happy to make a more elaborate proposal, but I push >>> my >>> main ideas here: >>> >>> 1) Simplifying by using one system >>> It could simplify the work of both the users and the developers. One >>> could >>> execute training once, or could execute it periodically e.g. by using >>> windows. Low-latency serving and training could be done in the same >>> system. >>> We could implement incremental algorithms, without any side inputs for >>> combining online learning (or predictions) with batch learning. Of >>> course, >>> all the logic describing these must be somehow implemented (e.g. >>> synchronizing predictions with training), but it should be easier to do >>> so >>> in one system, than by combining e.g. the batch and streaming API. >>> >>> 2) Batch ML with the streaming API is not harder >>> Despite these benefits, it could seem harder to implement batch ML with >>> the streaming API, but in my opinion it's not. There are more flexible, >>> lower-level optimization potentials with the streaming API. Most >>> distributed ML algorithms use a lower-level model than the batch API >>> anyway, so sometimes it feels like forcing the algorithm logic into the >>> training API and tweaking it. Although we could not use the batch >>> primitives like join, we would have the E.g. in my experience with >>> implementing a distributed matrix factorization algorithm [1], I couldn't >>> do a simple optimization because of the limitations of the iteration API >>> [2]. Even if we pushed all the development effort to make the batch API >>> more suitable for ML there would be things we couldn't do. E.g. there are >>> approaches for updating a model iteratively without locks [3,4] (i.e. >>> somewhat asynchronously), and I don't see a clear way to implement such >>> algorithms with the batch API. >>> >>> 3) Streaming community (users and devs) benefit >>> The Flink streaming community in general would also benefit from this >>> direction. There are many features needed in the streaming API for ML to >>> work, but this is also true for the batch API. One really important is >>> the >>> loops API (a.k.a. iterative DataStreams) [5]. There has been a lot of >>> effort (mostly from Paris) for making it mature enough [6]. Kate >>> mentioned >>> using GPUs, and I'm sure they have uses in streaming generally [7]. Thus, >>> by improving the streaming API to allow ML algorithms, the streaming API >>> benefit too (which is important as they have a lot more production users >>> than the batch API). >>> >>> 4) Performance can be at least as good >>> I believe the same performance could be achieved with the streaming API >>> as >>> with the batch API. Streaming API is much closer to the runtime than the >>> batch API. For corner-cases, with runtime-layer optimizations of batch >>> API, >>> we could find a way to do the same (or similar) optimization for the >>> streaming API (see my previous point). Such case could be using managed >>> memory (and spilling to disk). There are also benefits by default, e.g. >>> we >>> would have a finer grained fault tolerance with the streaming API. >>> >>> 5) We could keep batch ML API >>> For the shorter term, we should not throw away all the algorithms >>> implemented with the batch API. By pushing forward the development with >>> side inputs we could make them usable with streaming API. Then, if the >>> library gains some popularity, we could replace the algorithms in the >>> batch >>> API with streaming ones, to avoid the performance costs of e.g. not being >>> able to persist. >>> >>> 6) General tools for implementing ML algorithms >>> Besides implementing algorithms one by one, we could give more general >>> tools for making it easier to implement algorithms. E.g. parameter server >>> [8,9]. Theo also mentioned in another thread that TensorFlow has a >>> similar >>> model to Flink streaming, we could look into that too. I think often when >>> deploying a production ML system, much more configuration and tweaking >>> should be done than e.g. Spark MLlib allows. Why not allow that? >>> >>> 7) Showcasing >>> Showcasing this could be easier. We could say that we're doing batch ML >>> with a streaming API. That's interesting in its own. IMHO this >>> integration >>> is also a more approachable way towards end-to-end ML. >>> >>> >>> Thanks for reading so far :) >>> >>> [1] https://github.com/apache/flink/pull/2819 >>> [2] https://issues.apache.org/jira/browse/FLINK-2396 >>> [3] https://people.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf >>> [4] https://www.usenix.org/system/files/conference/hotos13/hotos >>> 13-final77.pdf >>> [5] https://cwiki.apache.org/confluence/display/FLINK/FLIP-15+ >>> Scoped+Loops+and+Job+Termination >>> [6] https://github.com/apache/flink/pull/1668 >>> [7] http://lsds.doc.ic.ac.uk/sites/default/files/saber-sigmod16.pdf >>> [8] https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf >>> [9] http://apache-flink-mailing-list-archive.1008284.n3.nabble. >>> com/Using-QueryableState-inside-Flink-jobs-and- >>> Parameter-Server-implementation-td15880.html >>> >>> Cheers, >>> Gabor >>> >>> >