[I'll be switching to [email protected] soon.] Hi Maximilian, I'd be all for including flink-dataflow as flink-beam as soon as possible.
The current github version of the runner depends on some 'internal' classes of dataflow 1.0 which have either moved, changed or disappeared in our (still internal) Beam preview. If you like I could put together a cl to fix as much of that as I can and hand it over to you so as to make the first drop as easy as possible. However, if it's a bit weird for me to be front loading work which depends on code no one else can see quite yet then I'm happy to defer that work. (Forgive me while I get used to this new way of working.) As you say, there are many features of our respective models which match up beautifully. The current flink-dataflow runner takes the sensible approach of translating dataflow pipelines at the lowest possible level. For example, it uses windowed values, defers to dataflows's triggering and group-also-by-windows machinery, and so on. In the longer term I'd like to see if we can translate more Beam features to their Flink equivalents. For example, if a Window.into only uses end-of-windew watermark triggers then it could be translated to use the corresponding Flink trigger. That would not only improve execution speed, but would also be a forcing function to keep the respective model semantics uniform. There's some very interesting design challenges for us there. Best, -m On Fri, Feb 12, 2016 at 11:06 AM, Maximilian Michels <[email protected]> wrote: > Hi Beamers, > > Now that things are getting started and we discuss the technical > vision of Beam, we would like to contribute the Flink runner and start > by sharing some details about the status and the roadmap. > > The Flink Runner integrates deeply and naturally with the Dataflow SDK > (the Beam precursor), because the Flink DataStream API shares many > concepts with the Dataflow model. > Based on whether the program input is bounded or unbounded, the > program goes against Flink's DataStream or DataSet API. > > A quick preview at some of the nice features of the runner: > > - Support for stream transformations, event time, watermarks > - The full Dataflow windowing semantics, including fixed/sliding > time windows, and session windows > - Integration with Flink's streaming sources (Kafka, RabbitMQ, ...) > > - Batch (bounded sources) integrates fully with Flink's managed > memory techniques and out-of-core algorithms, supporting huge joins > and aggregations. > - Integration with Flink's batch API sources (plain text, CSV, Avro, > JDBC, HBase, ...) > > - Integration with Flink's fault tolerance - both batch and > streaming program recover from failures > - After upgrading the dependency to Flink 1.0, one could even use > the Flink Savepoints feature (save streaming state for later resuming) > with the Dataflow programs. > > Attached you can find the document we drew up with more information > about the current state of the Runner and the roadmap for its upcoming > features: > > > https://docs.google.com/document/d/1QM_X70VvxWksAQ5C114MoAKb1d9Vzl2dLxEZM4WYogo/edit?usp=sharing > > The Runner executes the quasi complete Beam streaming model (well, > Dataflow, actually, because Beam is not there, yet). > > Given the current excitement and buzz around Beam, we could add this > runner to the Beam repository and link it as a "preview" for the > people that want to get a feeling of what it will be like to write and > run streaming (unbounded) Beam programs. That would give people > something tangible until the actual Beam code is available. > > What do you think? > > Best, > Max >
