Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

Ryan Ebanks Wed, 12 Oct 2016 11:56:02 -0700

Its been a long time since I contributed, but I would like to agree that
local runtime should be trashed/re-written.  Preferably in a akka
implementation that is lock free or as lock free as possible.  The current
one is not very good, and I say that knowing I wrote a lot of it.  I also
think it should be designed for local testing only and not for production
use.  There are too many other frameworks to use to justify the work needed
to fix/re-write it to a production standard.


On Wed, Oct 12, 2016 at 11:21 AM, sblackmon <sblack...@apache.org> wrote:

> On October 11, 2016 at 3:31:41 PM, Matt Franklin (m.ben.frank...@gmail.com)
> wrote:
> On Tue, Sep 27, 2016 at 6:05 PM sblackmon <sblack...@apache.org> wrote:
>
> > All,
> >
> >
> >
> > Joey brought this up over the weekend and I think a discussion is overdue
> > on the topic.
> >
> >
> >
> > Streams components were meant to be compatible with other runtime
> > frameworks all along, and for the most part are implemented in a manner
> > compatible with distributed execution where coordination, message
> passing,
> > and lifecycle and handled outside of streams libraries. By community
> > standards any component or component configuration object that doesn't
> > cleanly serializable for relocation in a distributed framework is a bug.
> >
>
> Agreed, though this could be more explicit.
>
> Some modules contain a unit test that checks for serializability of
> components.  Maybe we can find a way to systematize this such that every
> Provider, Processor, and Persister added to the code base gets run through
> a serializability check during mvn test.  We could try to catch up by
> adding similar tests throughout the code base and -1 new submissions that
> don’t include such a test, but that approach seems harder than doing
> something using Reflections.
>
>
> >
> >
> >
> > When the streams project got started in 2012 storm was the only TLP
> > real-time data processing framework at apache, but now there are plenty
> of
> > good choices all of which are faster and better tested than our
> > streams-runtime-local module.
> >
> >
>
> >
> > So, what should be the role of streams-runtime-local? Should we keep it
> > at all? The tests take forever to run and my organization has stopped
> > using it entirely. The best argument for keeping it is that it is useful
> > when integration testing small pipelines, but perhaps we could just agree
> > to use something else for that purpose?
> >
> >
> I think having a local runtime for testing or small streams is valuable,
> but there is a ton of work that needs to go into the current runtime.
>
> Yeah, the magnitude of that effort is why it might be worth considering
> starting from scratch.  We need a quality testing harness runtime at a
> minimum. local is suitable for that, barely.
>
>
> >
> >
> > Do we want to keep the other runtime modules around and continue adding
> > more? I’ve found that when embedding streams components in other
> > frameworks (spark and flink most recently) I end up creating a handful of
> > classes to help bind streams interfaces and instances within the pdfs /
> > functions / transforms / whatever are that framework atomic unit of
> > computation and reusing them in all my pipelines.
> >
> >
> I think this is valuable. A set of libraries that adapt a common
> programming model to various frameworks that simply stream development is
> inherently cool. Write once, run anywhere.
>
> It’s a cool idea, but I’ve never successfully used it that way.  Also as
> soon as you bring a batch framework like pig, spark, or flink into your
> design, streams persisters quickly become irrelevant because performance is
> usually better using the framework preferred libraries.  Streaming
> frameworks not as much but there’s a trade-off to consider with every
> integration point and I’ve found pretty much universally that the
> framework-maintained libraries tend to be faster.
>
>
> >
> >
> > How about the StreamBuilder interface? Does anyone still believe we
> > should support (and still want to work on) classes
> > implementing StreamBuilder to build and running a pipeline comprised
> solely
> > of streams components on other frameworks? Personally I prefer to write
> > code using the framework APIs at the pipeline level, and embed individual
> > streams components at the step level.
> >
> >
> I think this could be valuable if done better. For instance, binding
> classes to steps in the stream pipeline, rather than instances. This would
> let the aforementioned adapter libraries configure components using the
> programming model declared by streams and setup pipelines in target
> systems.
>
> It’s a cool idea, but i think down that path we’d wind up with pretty
> beefy runtimes, loss of clean separation between modules, and unwanted
> transitive dependencies.  For example, an hdfs persist reader embedded in a
> spark pipeline should be interpreted as sc.readTextFile / readSequenceFile,
> or else spark.* properties that determine read behavior won’t be picked
> up.  An elastic search persist writer embedded in a flink pipeline should
> be interpreted to use 
> org.apache.flink.streaming.connectors.elasticsearch.ElasticsearchSink.
> Or just maybe 
> org.apache.flink.streaming.connectors.elasticsearch2.ElasticsearchSink 
> depending
> on what’s available on the class path. Pretty soon each runtime becomes a
> crazy hard to test monolith because it has to be to properly optimize how
> it interprets each component.  That’s my fear anyway and it’s why I’m
> leaning more toward runtimes that don’t have a StreamBuilder at all, and
> just provide configuration support and static helper methods.
>
>
> >
> >
> > Any other thoughts on the topic?
> >
> >
> >
> > Steve
> >
> >
> >
> > - What should the focus be? If you look at the code, the project really
> > provides 3 things: (1) a stream processing engine and integration with
> data
> > persistence mechanisms, (2) a reference implementation of
> ActivityStreams,
> > AS schemas, and tools for interlinking activity objects and events, and
> (3)
> > a uniform API for integrating with social network APIs. I don't think
> that
> > first thing is needed anymore. Just looking at Apache projects, NiFi,
> Apex
> > + Apex Malhar, and to some extent Flume are further along here. Stream
> Sets
> > covers some of this too, and arguably Logstash also gets used for this
> sort
> > of work. I.e., I think the project would be much stronger if it focused
> on
> > (2) and (3) and marrying those up to other Apache projects that fit (1).
> > Minimally, it needs to be de-entangled a bit.
>

Re: [DISCUSS] What to do with streams-runtime-local and other streams-runtimes modules

Reply via email to