Joe - thanks for bumping this.

Bryan,

"What are the best practices for implementing a processor that needs to
maintain some kind of state?

I'm thinking of a processor that executes on a timer and pulls data from
somewhere, but needs to know where it left off for the next execution, and
I was hoping to not involve an external data store here."

The only managed state the framework provides is through the use of Flow
File objects and the passing of them between processors.  To keep
persistent accounting for a given processor of some state of what its doing
that exists outside of that then you do need to implement some state
persistence mechanism (to a file, to a database, etc..).

One example of a processor that does this is the GetHttp processor.  It
interacts with web services and in so doing needs to keep track of any
cache/E-Tag information it receives so it can be smart about pulling the
same resource or not depending on whether the server indicates it has
changed.  How this processor does this is by saving off a file in
'conf/.httpCache-<<processor uuid>>'  This use of the processor uuid in the
name avoids conflicts with other processors of the same type and makes
referencing it on startup very easy.  If it is there use it to recover
state and if not start a new one.

That said it is clearly desirable for the framework to offer some sort of
managed state mechanism for such simple cases.  We've talked about this
many times over the years but just never pulled the trigger because there
was always some aspect of our design ideas we didn't like.  So for right
now you'll need to implement state persistence like this outside the
framework.  But I've also kicked off a Jira for doing something about this
here: https://issues.apache.org/jira/browse/NIFI-259

What you were seeing in GetKafka and GetJMS processors was management of
state that involves interaction with their specific resources (Kafka,
JMS).  In the case of JMS it was a connection pooling type mechanism and in
the case of Kafka it was part of Kafkas stream iterator.   That is a
different thing than this managed persistent state you're asking about.

This is an important topic for us to communicate very well on.  Please feel
free to keep firing away until we've answered it fully.

Thanks
Joe

On Wed, Jan 14, 2015 at 5:06 AM, Joe Gresock <[email protected]> wrote:

> I'm also interested in the answers to Bryan's questions, if anyone has some
> input.
>
> Thanks,
> Joe
>
> On Fri, Jan 9, 2015 at 3:50 PM, Bryan Bende <[email protected]> wrote:
>
> > What are the best practices for implementing a processor that needs to
> > maintain some kind of state?
> >
> > I'm thinking of a processor that executes on a timer and pulls data from
> > somewhere, but needs to know where it left off for the next execution,
> and
> > I was hoping to not involve an external data store here.
> >
> > From looking at processors like GetJMS and GetKafka, I noticed the use of
> > BlockingQueue<> where poll() is called at the beginning of onTrigger(),
> and
> > then the object is put back in the queue in a finally block.
> >
> > As far as I could tell it looks like the intent was to only have one
> object
> > in the queue, and use the queue as the mechanism for synchronizing access
> > to the shared object, so that if another thread called onTrigger it would
> > block on poll() until the previous execution put the object back in the
> > queue.
> >
> > Is that the general approach?
> >
> > Thanks,
> >
> > Bryan
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*
>

Reply via email to