Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

Matt Franklin Wed, 12 Oct 2016 12:48:40 -0700

On Wed, Oct 12, 2016 at 2:55 PM Ryan Ebanks <ryaneba...@gmail.com> wrote:


> Its been a long time since I contributed, but I would like to agree that
> local runtime should be trashed/re-written.  Preferably in a akka
> implementation that is lock free or as lock free as possible.  The current
> one is not very good, and I say that knowing I wrote a lot of it.  I also
> think it should be designed for local testing only and not for production
> use.  There are too many other frameworks to use to justify the work needed
> to fix/re-write it to a production standard.
>

Agreed this wouldn't be for production use cases.  The biggest issue with
making it reactive is the forced polling of providers.  If we change the
core programming model, it would make a lot of things much more simple.


>
> On Wed, Oct 12, 2016 at 11:21 AM, sblackmon <sblack...@apache.org> wrote:
>
> On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.frank...@gmail.com)
> wrote:
>
> On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sblack...@apache.org> wrote:
>
>
>
> > All,
>
> >
>
> >
>
> >
>
> > Joey brought this up over the weekend and I think a discussion is overdue
>
> > on the topic.
>
> >
>
> >
>
> >
>
> > Streams components were meant to be compatible with other runtime
>
> > frameworks all along, and for the most part are implemented in a manner
>
> > compatible with distributed execution where coordination, message
> passing,
>
> > and lifecycle and handled outside of streams libraries. By community
>
> > standards any component or component configuration object that doesn't
>
> > cleanly serializable for relocation in a distributed framework is a bug.
>
> >
>
>
>
> Agreed, though this could be more explicit.
>
>
>
> Some modules contain a unit test that checks for serializability of
> components.  Maybe we can find a way to systematize this such that every
> Provider, Processor, and Persister added to the code base gets run through
> a serializability check during mvn test.  We could try to catch up by
> adding similar tests throughout the code base and -1 new submissions that
> don’t include such a test, but that approach seems harder than doing
> something using Reflections.
>
>
>
>
>
> >
>
> >
>
> >
>
> > When the streams project got started in 2012 storm was the only TLP
>
> > real-time data processing framework at apache, but now there are plenty
> of
>
> > good choices all of which are faster and better tested than our
>
> > streams-runtime-local module.
>
> >
>
> >
>
>
>
> >
>
> > So, what should be the role of streams-runtime-local? Should we keep it
>
> > at all? The tests take forever to run and my organization has stopped
>
> > using it entirely. The best argument for keeping it is that it is useful
>
> > when integration testing small pipelines, but perhaps we could just agree
>
> > to use something else for that purpose?
>
> >
>
> >
>
> I think having a local runtime for testing or small streams is valuable,
>
> but there is a ton of work that needs to go into the current runtime.
>
>
>
> Yeah, the magnitude of that effort is why it might be worth considering
> starting from scratch.  We need a quality testing harness runtime at a
> minimum. local is suitable for that, barely.
>
>
>
>
>
> >
>
> >
>
> > Do we want to keep the other runtime modules around and continue adding
>
> > more? I’ve found that when embedding streams components in other
>
> > frameworks (spark and flink most recently) I end up creating a handful of
>
> > classes to help bind streams interfaces and instances within the pdfs /
>
> > functions / transforms / whatever are that framework atomic unit of
>
> > computation and reusing them in all my pipelines.
>
> >
>
> >
>
> I think this is valuable. A set of libraries that adapt a common
>
> programming model to various frameworks that simply stream development is
>
> inherently cool. Write once, run anywhere.
>
>
>
> It’s a cool idea, but I’ve never successfully used it that way.  Also as
> soon as you bring a batch framework like pig, spark, or flink into your
> design, streams persisters quickly become irrelevant because performance is
> usually better using the framework preferred libraries.  Streaming
> frameworks not as much but there’s a trade-off to consider with every
> integration point and I’ve found pretty much universally that the
> framework-maintained libraries tend to be faster.
>
>
>
>
>
> >
>
> >
>
> > How about the StreamBuilder interface? Does anyone still believe we
>
> > should support (and still want to work on) classes
>
> > implementing StreamBuilder to build and running a pipeline comprised
> solely
>
> > of streams components on other frameworks? Personally I prefer to write
>
> > code using the framework APIs at the pipeline level, and embed individual
>
> > streams components at the step level.
>
> >
>
> >
>
> I think this could be valuable if done better. For instance, binding
>
> classes to steps in the stream pipeline, rather than instances. This would
>
> let the aforementioned adapter libraries configure components using the
>
> programming model declared by streams and setup pipelines in target
>
> systems.
>
>
>
> It’s a cool idea, but i think down that path we’d wind up with pretty
> beefy runtimes, loss of clean separation between modules, and unwanted
> transitive dependencies.  For example, an hdfs persist reader embedded in a
> spark pipeline should be interpreted as sc.readTextFile / readSequenceFile,
> or else spark.* properties that determine read behavior won’t be picked
> up.  An elastic search persist writer embedded in a flink pipeline should
> be interpreted to
> use org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink.
> Or just
> maybe org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink 
> depending
> on what’s available on the class path. Pretty soon each runtime becomes a
> crazy hard to test monolith because it has to be to properly optimize how
> it interprets each component.  That’s my fear anyway and it’s why I’m
> leaning more toward runtimes that don’t have a StreamBuilder at all, and
> just provide configuration support and static helper methods.
>
>
>
>
>
> >
>
> >
>
> > Any other thoughts on the topic?
>
> >
>
> >
>
> >
>
> > Steve
>
> >
>
> >
>
> >
>
> > - What should the focus be? If you look at the code, the project really
>
> > provides 3 things: (1) a stream processing engine and integration with
> data
>
> > persistence mechanisms, (2) a reference implementation of
> ActivityStreams,
>
> > AS schemas, and tools for interlinking activity objects and events, and
> (3)
>
> > a uniform API for integrating with social network APIs. I don't think
> that
>
> > first thing is needed anymore. Just looking at Apache projects, NiFi,
> Apex
>
> > + Apex Malhar, and to some extent Flume are further along here. Stream
> Sets
>
> > covers some of this too, and arguably Logstash also gets used for this
> sort
>
> > of work. I.e., I think the project would be much stronger if it focused
> on
>
> > (2) and (3) and marrying those up to other Apache projects that fit (1).
>
> > Minimally, it needs to be de-entangled a bit.
>
>
>
>

Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

Reply via email to