On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.frank...@gmail.com)
On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sblack...@apache.org> wrote:
> Joey brought this up over the weekend and I think a discussion is overdue
> on the topic.
> Streams components were meant to be compatible with other runtime
> frameworks all along, and for the most part are implemented in a manner
> compatible with distributed execution where coordination, message passing,
> and lifecycle and handled outside of streams libraries. By community
> standards any component or component configuration object that doesn't
> cleanly serializable for relocation in a distributed framework is a bug.
Agreed, though this could be more explicit.
Some modules contain a unit test that checks for serializability of components.
Maybe we can find a way to systematize this such that every Provider,
Processor, and Persister added to the code base gets run through a
serializability check during mvn test. We could try to catch up by adding
similar tests throughout the code base and -1 new submissions that don’t
include such a test, but that approach seems harder than doing something using
> When the streams project got started in 2012 storm was the only TLP
> real-time data processing framework at apache, but now there are plenty of
> good choices all of which are faster and better tested than our
> streams-runtime-local module.
> So, what should be the role of streams-runtime-local? Should we keep it
> at all? The tests take forever to run and my organization has stopped
> using it entirely. The best argument for keeping it is that it is useful
> when integration testing small pipelines, but perhaps we could just agree
> to use something else for that purpose?
I think having a local runtime for testing or small streams is valuable,
but there is a ton of work that needs to go into the current runtime.
Yeah, the magnitude of that effort is why it might be worth considering
starting from scratch. We need a quality testing harness runtime at a minimum.
local is suitable for that, barely.
> Do we want to keep the other runtime modules around and continue adding
> more? I’ve found that when embedding streams components in other
> frameworks (spark and flink most recently) I end up creating a handful of
> classes to help bind streams interfaces and instances within the pdfs /
> functions / transforms / whatever are that framework atomic unit of
> computation and reusing them in all my pipelines.
I think this is valuable. A set of libraries that adapt a common
programming model to various frameworks that simply stream development is
inherently cool. Write once, run anywhere.
It’s a cool idea, but I’ve never successfully used it that way. Also as soon
as you bring a batch framework like pig, spark, or flink into your design,
streams persisters quickly become irrelevant because performance is usually
better using the framework preferred libraries. Streaming frameworks not as
much but there’s a trade-off to consider with every integration point and I’ve
found pretty much universally that the framework-maintained libraries tend to
> How about the StreamBuilder interface? Does anyone still believe we
> should support (and still want to work on) classes
> implementing StreamBuilder to build and running a pipeline comprised solely
> of streams components on other frameworks? Personally I prefer to write
> code using the framework APIs at the pipeline level, and embed individual
> streams components at the step level.
I think this could be valuable if done better. For instance, binding
classes to steps in the stream pipeline, rather than instances. This would
let the aforementioned adapter libraries configure components using the
programming model declared by streams and setup pipelines in target
It’s a cool idea, but i think down that path we’d wind up with pretty beefy
runtimes, loss of clean separation between modules, and unwanted transitive
dependencies. For example, an hdfs persist reader embedded in a spark pipeline
should be interpreted as sc.readTextFile / readSequenceFile, or else spark.*
properties that determine read behavior won’t be picked up. An elastic search
persist writer embedded in a flink pipeline should be interpreted to use
org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink. Or just
depending on what’s available on the class path. Pretty soon each runtime
becomes a crazy hard to test monolith because it has to be to properly optimize
how it interprets each component. That’s my fear anyway and it’s why I’m
leaning more toward runtimes that don’t have a StreamBuilder at all, and just
provide configuration support and static helper methods.
> Any other thoughts on the topic?
> - What should the focus be? If you look at the code, the project really
> provides 3 things: (1) a stream processing engine and integration with data
> persistence mechanisms, (2) a reference implementation of ActivityStreams,
> AS schemas, and tools for interlinking activity objects and events, and (3)
> a uniform API for integrating with social network APIs. I don't think that
> first thing is needed anymore. Just looking at Apache projects, NiFi, Apex
> + Apex Malhar, and to some extent Flume are further along here. Stream Sets
> covers some of this too, and arguably Logstash also gets used for this sort
> of work. I.e., I think the project would be much stronger if it focused on
> (2) and (3) and marrying those up to other Apache projects that fit (1).
> Minimally, it needs to be de-entangled a bit.