Preferably in a akka implementation that is lock free or as lock free as possible.
Flink has a local environment (based on akka) that I’ve found to work very well for integration testing of flink data pipelines, which I think could also be suitable for testing and running streams component pipelines. A few nice things about Flink’s architecture vis-a-vis Apache Streams: * Sub-second spin-up and tear-down of local environment * Near-real-time propagation of data between stages (not micro-batch) * Exactly-once data guarantees from many sources, including kafka and hdfs, via checkpointing * Ability to create a snapshot (checkpoint) on-demand and restart the pipeline, preserving state of * Most data in-flight is kept in binary blocks that spill to disk, so no out of heap crashes even when memory is constrained. On October 12, 2016 at 1:55:18 PM, Ryan Ebanks (ryaneba...@gmail.com) wrote: Its been a long time since I contributed, but I would like to agree that local runtime should be trashed/re-written. Preferably in a akka implementation that is lock free or as lock free as possible. The current one is not very good, and I say that knowing I wrote a lot of it. I also think it should be designed for local testing only and not for production use. There are too many other frameworks to use to justify the work needed to fix/re-write it to a production standard. On Wed, Oct 12, 2016 at 11:21 AM, sblackmon <sblack...@apache.org> wrote: > On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.frank...@gmail.com) > wrote: > On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sblack...@apache.org> wrote: > > > All, > > > > > > > > Joey brought this up over the weekend and I think a discussion is overdue > > on the topic. > > > > > > > > Streams components were meant to be compatible with other runtime > > frameworks all along, and for the most part are implemented in a manner > > compatible with distributed execution where coordination, message > passing, > > and lifecycle and handled outside of streams libraries. By community > > standards any component or component configuration object that doesn't > > cleanly serializable for relocation in a distributed framework is a bug. > > > > Agreed, though this could be more explicit. > > Some modules contain a unit test that checks for serializability of > components. Maybe we can find a way to systematize this such that every > Provider, Processor, and Persister added to the code base gets run through > a serializability check during mvn test. We could try to catch up by > adding similar tests throughout the code base and -1 new submissions that > don’t include such a test, but that approach seems harder than doing > something using Reflections. > > > > > > > > > > When the streams project got started in 2012 storm was the only TLP > > real-time data processing framework at apache, but now there are plenty > of > > good choices all of which are faster and better tested than our > > streams-runtime-local module. > > > > > > > > > So, what should be the role of streams-runtime-local? Should we keep it > > at all? The tests take forever to run and my organization has stopped > > using it entirely. The best argument for keeping it is that it is useful > > when integration testing small pipelines, but perhaps we could just agree > > to use something else for that purpose? > > > > > I think having a local runtime for testing or small streams is valuable, > but there is a ton of work that needs to go into the current runtime. > > Yeah, the magnitude of that effort is why it might be worth considering > starting from scratch. We need a quality testing harness runtime at a > minimum. local is suitable for that, barely. > > > > > > > > Do we want to keep the other runtime modules around and continue adding > > more? I’ve found that when embedding streams components in other > > frameworks (spark and flink most recently) I end up creating a handful of > > classes to help bind streams interfaces and instances within the pdfs / > > functions / transforms / whatever are that framework atomic unit of > > computation and reusing them in all my pipelines. > > > > > I think this is valuable. A set of libraries that adapt a common > programming model to various frameworks that simply stream development is > inherently cool. Write once, run anywhere. > > It’s a cool idea, but I’ve never successfully used it that way. Also as > soon as you bring a batch framework like pig, spark, or flink into your > design, streams persisters quickly become irrelevant because performance is > usually better using the framework preferred libraries. Streaming > frameworks not as much but there’s a trade-off to consider with every > integration point and I’ve found pretty much universally that the > framework-maintained libraries tend to be faster. > > > > > > > > How about the StreamBuilder interface? Does anyone still believe we > > should support (and still want to work on) classes > > implementing StreamBuilder to build and running a pipeline comprised > solely > > of streams components on other frameworks? Personally I prefer to write > > code using the framework APIs at the pipeline level, and embed individual > > streams components at the step level. > > > > > I think this could be valuable if done better. For instance, binding > classes to steps in the stream pipeline, rather than instances. This would > let the aforementioned adapter libraries configure components using the > programming model declared by streams and setup pipelines in target > systems. > > It’s a cool idea, but i think down that path we’d wind up with pretty > beefy runtimes, loss of clean separation between modules, and unwanted > transitive dependencies. For example, an hdfs persist reader embedded in a > spark pipeline should be interpreted as sc.readTextFile / readSequenceFile, > or else spark.* properties that determine read behavior won’t be picked > up. An elastic search persist writer embedded in a flink pipeline should > be interpreted to use > org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink. > Or just maybe > org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink > depending > on what’s available on the class path. Pretty soon each runtime becomes a > crazy hard to test monolith because it has to be to properly optimize how > it interprets each component. That’s my fear anyway and it’s why I’m > leaning more toward runtimes that don’t have a StreamBuilder at all, and > just provide configuration support and static helper methods. > > > > > > > > Any other thoughts on the topic? > > > > > > > > Steve > > > > > > > > - What should the focus be? If you look at the code, the project really > > provides 3 things: (1) a stream processing engine and integration with > data > > persistence mechanisms, (2) a reference implementation of > ActivityStreams, > > AS schemas, and tools for interlinking activity objects and events, and > (3) > > a uniform API for integrating with social network APIs. I don't think > that > > first thing is needed anymore. Just looking at Apache projects, NiFi, > Apex > > + Apex Malhar, and to some extent Flume are further along here. Stream > Sets > > covers some of this too, and arguably Logstash also gets used for this > sort > > of work. I.e., I think the project would be much stronger if it focused > on > > (2) and (3) and marrying those up to other Apache projects that fit (1). > > Minimally, it needs to be de-entangled a bit. >