Fundamentally processors as initially conceived do not maintain fire events autonomously or maintain state between messages. Changing that paradigm would mean Pig/MR1 would not longer be capable of serving as a full-featured processor runtime. Agree this is limiting, but only in terms of what a processor can do, not what streams can do.
Steve Blackmon > On May 14, 2014, at 2:02 PM, Ryan Ebanks <ryaneba...@gmail.com> wrote: > > I think a processor should solely be responsible for processing data. I > think the interface describes exactly what a processor should do, get a > piece of data and produce out put data. Having it do more than that is > expanding the functionality of a processor beyond its intent. > > I do agree that being able to fire off events for lack of data is > necessary, but that should be the responsibility of the provider. The > provider is the connection to the outside world and should be in charge of > producing new data and firing off events when new data is not produced or > missing. The provider should know exactly when the last piece of new data > entered the system, and so it will know when to send alerts. Everything > downstream of the provider should be a closed flow and have guarantees of > processing. Which most runtimes do, the local runtime needs a lot of love > though. > > Without a doubt the runtime environments need to be improved and rigorously > tested to make sure that once data enters a stream from a provider, that no > data is dropped and completes the stream. However, I do not believe that > changing the processor is the answer that problem. > > Ryan Ebanks > > > On Wed, May 14, 2014 at 3:12 PM, Matthew Hager [W2O Digital] < > mha...@w2odigital.com> wrote: > >> Matt, >> >> >> As always thanks for your feedback and mentorship as I work to contribute >> to this project. >> >> I feel the current pattern for processor is extremely limiting given the >> constraint of the return statement rather than the alternative of receive >> -> process -> write. It seems that if we revise the current interfaces we >> could handle many more use-cases with streams that cannot currently be >> accomplished. >> >> Current "StreamsProcessor": >> >> -- Contract -- >> List<StreamsDatum> process(StreamsDatum datum); >> >> -- Drawbacks -- >> For the processor to access the 'writer' he must have an item to be >> processed. If he doesn't have an item, then he can't write anything >> downstream. >> >> -- Use Cases Not Satisfied -- >> In many cases the absence of data on a 'stream' is an event in itself. For >> instance, if I want to have a listener at various portions of the process >> to ensure my processing data providers are working appropriately. If I >> want to send out a message, log an event to Oracle, and alert PagerDuty, >> for stale data. I can write a quick processor that checks for stale data, >> then upon recognition of stale data (timer expiring), fire the proper >> events to it's writers. >> >> Other use cases involve... debugging, stream summarization metrics, >> trending algorithms, processing stats, and many others. > > > > > > >> >> +++ Proposed Interfaces +++ >> >> // Analogous to Provider >> public interface Provider { >> push(StreamsDatum); >> registerListener(Listener); >> } >> >> >> // Analogous to Writer >> public interface Listener { >> receive(StreamsDatum); >> } >> >> >> // Analogous to Processor >> public interface Processor implements Prover, Listener { >> >> } >> >> +++ Conclusion +++ >> I feel the event driven model is more flexible than the current model. We >> can write an interface for the legacy 'contract' to ease implementation / >> transition. >> >> IE: >> >> interface LegacyProcessor { >> List<StreamsDatum> process(StreamsDatum datum); >> >> } >> >> class LegacyProcessorImpl { >> >> listen(StreamsDatum datum) {? >> forEach(StreamsDatum d : this.transform(datum) { >> ?this.write(datum);? >> } >> ?} >> >> } >> >> >> I really love streams, I think the concept is great, I think adding this >> would take streams from "great" --> "OMG, the whole world should use this >> all the time for everything ever, it is flipping amazing and better than >> ice cream on a hot sunny day." >> >> >> Thoughts? >> >> >> >> >> Thanks! >> Smashew >> >> >> >>> On 5/13/14, 2:38 PM, "Matt Franklin" <m.ben.frank...@gmail.com> wrote: >>> >>> On Tue, May 6, 2014 at 10:53 PM, Matthew Hager [W2O Digital] < >>> mha...@w2odigital.com> wrote: >>> >>> <snip /> >>> >>> >>>> StreamsResultSet - I actually found this to be quite useful paradigm. A >>>> queue prevents a buffer overflow, an iterator makes it fun and easy to >>>> read (I love iterators), and it is simple and succinct. I do, however, >>>> feel it is best expressed as an interface instead of a class. >>> >>> >>> I agree with an interface. IMO, anything that is not a utility style >>> helper should be interacted with via its interface. >>> >>> >>>> The thing missing from this, as an interface, would be the notion of >>>> "isRunning" which could easily >>>> satisfy both of the aforementioned modalities. >>> >>> Reasonable. >>> >>> >>>> Event Driven - I concur with Matt 100% on this. As currently >>>> implemented, >>>> LocalStreamsBuilder is exceedingly inefficient from a memory perspective >>>> and time execution perspective. To me, it seems, that we could almost >>>> abstract out 2 common interfaces to make this happen. >>>> >>>> * Listener { receive(StreamsDatum); } >>>> * Producer { push(StreamsDatum); registerListener(Listener); } >>>> >>>> Where the following implementations would place: >>>> >>>> * Reader implements Producer >>>> * Processor implements Producer, Listener >>>> * Writer implements Listener >>> >>> Seems logical. I would like to see the two possible operating modes >>> represented as distinct interfaces. >>> >> >>