Preferably in a akka implementation that is lock free or as lock free as
Flink has a local environment (based on akka) that I’ve found to work very well
for integration testing of flink data pipelines, which I think could also be
suitable for testing and running streams component pipelines.
A few nice things about Flink’s architecture vis-a-vis Apache Streams:
* Sub-second spin-up and tear-down of local environment
* Near-real-time propagation of data between stages (not micro-batch)
* Exactly-once data guarantees from many sources, including kafka and hdfs, via
* Ability to create a snapshot (checkpoint) on-demand and restart the pipeline,
preserving state of
* Most data in-flight is kept in binary blocks that spill to disk, so no out of
heap crashes even when memory is constrained.
On October 12, 2016 at 1:55:18 PM, Ryan Ebanks (ryaneba...@gmail.com) wrote:
Its been a long time since I contributed, but I would like to agree that
local runtime should be trashed/re-written. Preferably in a akka
implementation that is lock free or as lock free as possible. The current
one is not very good, and I say that knowing I wrote a lot of it. I also
think it should be designed for local testing only and not for production
use. There are too many other frameworks to use to justify the work needed
to fix/re-write it to a production standard.
On Wed, Oct 12, 2016 at 11:21 AM, sblackmon <sblack...@apache.org> wrote:
> On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.frank...@gmail.com)
> On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sblack...@apache.org> wrote:
> > All,
> > Joey brought this up over the weekend and I think a discussion is overdue
> > on the topic.
> > Streams components were meant to be compatible with other runtime
> > frameworks all along, and for the most part are implemented in a manner
> > compatible with distributed execution where coordination, message
> > and lifecycle and handled outside of streams libraries. By community
> > standards any component or component configuration object that doesn't
> > cleanly serializable for relocation in a distributed framework is a bug.
> Agreed, though this could be more explicit.
> Some modules contain a unit test that checks for serializability of
> components. Maybe we can find a way to systematize this such that every
> Provider, Processor, and Persister added to the code base gets run through
> a serializability check during mvn test. We could try to catch up by
> adding similar tests throughout the code base and -1 new submissions that
> don’t include such a test, but that approach seems harder than doing
> something using Reflections.
> > When the streams project got started in 2012 storm was the only TLP
> > real-time data processing framework at apache, but now there are plenty
> > good choices all of which are faster and better tested than our
> > streams-runtime-local module.
> > So, what should be the role of streams-runtime-local? Should we keep it
> > at all? The tests take forever to run and my organization has stopped
> > using it entirely. The best argument for keeping it is that it is useful
> > when integration testing small pipelines, but perhaps we could just agree
> > to use something else for that purpose?
> I think having a local runtime for testing or small streams is valuable,
> but there is a ton of work that needs to go into the current runtime.
> Yeah, the magnitude of that effort is why it might be worth considering
> starting from scratch. We need a quality testing harness runtime at a
> minimum. local is suitable for that, barely.
> > Do we want to keep the other runtime modules around and continue adding
> > more? I’ve found that when embedding streams components in other
> > frameworks (spark and flink most recently) I end up creating a handful of
> > classes to help bind streams interfaces and instances within the pdfs /
> > functions / transforms / whatever are that framework atomic unit of
> > computation and reusing them in all my pipelines.
> I think this is valuable. A set of libraries that adapt a common
> programming model to various frameworks that simply stream development is
> inherently cool. Write once, run anywhere.
> It’s a cool idea, but I’ve never successfully used it that way. Also as
> soon as you bring a batch framework like pig, spark, or flink into your
> design, streams persisters quickly become irrelevant because performance is
> usually better using the framework preferred libraries. Streaming
> frameworks not as much but there’s a trade-off to consider with every
> integration point and I’ve found pretty much universally that the
> framework-maintained libraries tend to be faster.
> > How about the StreamBuilder interface? Does anyone still believe we
> > should support (and still want to work on) classes
> > implementing StreamBuilder to build and running a pipeline comprised
> > of streams components on other frameworks? Personally I prefer to write
> > code using the framework APIs at the pipeline level, and embed individual
> > streams components at the step level.
> I think this could be valuable if done better. For instance, binding
> classes to steps in the stream pipeline, rather than instances. This would
> let the aforementioned adapter libraries configure components using the
> programming model declared by streams and setup pipelines in target
> It’s a cool idea, but i think down that path we’d wind up with pretty
> beefy runtimes, loss of clean separation between modules, and unwanted
> transitive dependencies. For example, an hdfs persist reader embedded in a
> spark pipeline should be interpreted as sc.readTextFile / readSequenceFile,
> or else spark.* properties that determine read behavior won’t be picked
> up. An elastic search persist writer embedded in a flink pipeline should
> be interpreted to use
> Or just maybe
> on what’s available on the class path. Pretty soon each runtime becomes a
> crazy hard to test monolith because it has to be to properly optimize how
> it interprets each component. That’s my fear anyway and it’s why I’m
> leaning more toward runtimes that don’t have a StreamBuilder at all, and
> just provide configuration support and static helper methods.
> > Any other thoughts on the topic?
> > Steve
> > - What should the focus be? If you look at the code, the project really
> > provides 3 things: (1) a stream processing engine and integration with
> > persistence mechanisms, (2) a reference implementation of
> > AS schemas, and tools for interlinking activity objects and events, and
> > a uniform API for integrating with social network APIs. I don't think
> > first thing is needed anymore. Just looking at Apache projects, NiFi,
> > + Apex Malhar, and to some extent Flume are further along here. Stream
> > covers some of this too, and arguably Logstash also gets used for this
> > of work. I.e., I think the project would be much stronger if it focused
> > (2) and (3) and marrying those up to other Apache projects that fit (1).
> > Minimally, it needs to be de-entangled a bit.