To extend the functionality of Flink a separate branch of development was
dedicated for low latency, distributed stream processing support. The
development started during March of 2014 and is approaching a state where
it might be considered a candidate for becoming part of the main repository.

As of today a WordCount
<https://github.com/stratosphere/stratosphere-streaming/blob/master/src/main/java/eu/stratosphere/streaming/examples/wordcount/WordCountLocal.java#L30-41>
example streaming program would fairly similar to the one that the batch
API provides:

StreamExecutionEnvironment env = new StreamExecutionEnvironment();

DataStream<Tuple2<String, Integer>> dataStream =

  env.readTextFile("src/test/resources/testdata/hamlet.txt")
                                                         .flatMap(new
WordCountSplitter())
                                                         .partitionBy(0)
                                                         .map(new
WordCountCounter());

dataStream.print();

env.execute();

The user defined functions are extending the same classes as in the batch
case (e.g. a FlatMapFunction for a flatmap, see WordCountSplitter
<https://github.com/stratosphere/stratosphere-streaming/blob/master/src/main/java/eu/stratosphere/streaming/examples/wordcount/WordCountSplitter.java>)
thus providing code interusability between the two approaches.

As for performance the 0.1 version
<https://github.com/stratosphere/stratosphere-streaming/tree/release-0.1>
released in the beginning of June was slightly better on a single core then
Apache Storm, one of the major players of the field. Cluster performance
needs further optimization. This version provided a lower level API, fairly
similar to the one Storm has. For a deeper dive on this state of the
development and the challenges faced please refer to the slides
<http://info.ilab.sztaki.hu/~mbalassi/dw_forum_2014/dw_forum_streaming.pdf>
of a talk form the early days of June.

The 0.2 release is coming soon with the the above demonstrated new API and
improved single core performance. To complete the release the cluster
performance is being measured, and the code is being decomposed into three
subprojects separating core, example and addon functionality.

As for the future fault tolerance is an unresolved issue and as a part of
the Google Summer of Code project an intern is working on iterative stream
processing.

The project is mainly developed at Budapest by three members employed by
Hungarian Academy of Sciences and Eötvös Loránd University and Frank Wu,
our Google Summer of Code student from Singapore. This summer the Hungarian
Academy of Sciences also dedicated 4 interns to the project.

The proposed 0.2 release is still dependant on the 0.5 release of
Stratosphere, however on branch snapshot-0.6
<https://github.com/stratosphere/stratosphere-streaming/tree/snapshot-0.6> the
dependencies are updated to 0.6-snapshot, thus the codebase is ready for
becoming part of the main project - preferably a part of addons until it
becomes stable.

Looking forward to your suggestions.

Cheers,

Márton, Gyula & Gábor

Reply via email to