Re: [DISCUSS] Run Once scheduling

Irizarry Jr., Nazario Thu, 12 Jan 2017 12:55:46 -0800

I think it is a matter of the model in one's head.  If one thinks of a 
continuous activation paradigm the green arrow versus red square indicate what 
you point out.  On the other hand in an ad-hoc run-once paradigm the green 
arrow is a nice succinct indicator of what has not run yet.  In an analytics 
environment processing can take minutes to hours for some processors.  As  
processing goes on the processors with the remaining green arrows indicate what 
is left to complete in the “visual script.”


Consider the following example. Say there there are five processors. The first 
processor, say A, makes a query and gets data.  Depending on what I know about 
today’s input to A the output should be directed to B1, B2, B3, or B4.  The B's 
are actually variations on a particular analytic algorithm and most of the time 
only one of them needs to be used.  On one day (based on external knowledge) I 
click on A and B1 and then the Start arrow.  On another day I modify the query, 
click on A and B2 and then click on the Start arrow.  etc, Clearly I could have 
four flows and I could start/stop entire flows.  But, as the number of 
processing stages increases and the number of processing alternatives increases 
at each stage the combinatorial growth makes distinct flows painful to manage.  
Sometimes it is easier to have one all encompassing flow and then allow the 
analyst to shift click the portions they want to invoke for the next “run."


Naz Irizarry
MITRE Corp.
617-893-0074



> On Jan 12, 2017, at 2:14 PM, Joe Witt <[email protected]> wrote:
> 
> Naz
> 
> The green arrow vs red square says "scheduled to execute" vs "not
> scheduled to execute".  For most processors, such as those which take
> input flow files from a connection, even if they're scheduled to run
> they're not going to be executed unless there is work to do (data
> sitting in the queue) and space available (on all destination
> relationships).  Because of this I'm suggesting to consider just
> leaving them all scheduled to execute even though they won't actually
> be doing anything most of the time.  The stats on each component tell
> you how many times it was actually invoked and how much data it
> processed, etc..  So you'll see that they're not doing anything most
> of the time.
> 
> You mentioned not wanting to have to do anything manual yet run once
> would be a manual construct, right?
> 
> I dont mean to suggest I'm closed off to the idea of a run once
> concept I just really want to understand your use case better.
> 
> Thanks
> Joe
> 
> On Thu, Jan 12, 2017 at 2:11 PM, Irizarry Jr., Nazario <[email protected]> wrote:
>> Correction, that was the processor scheduler’s stopProcessor() method that 
>> needs to be invoked so the UI shows that the processor is stopped.
>> 
>> Naz Irizarry
>> MITRE Corp.
>> 617-893-0074
>> 
>> 
>> 
>>> On Jan 12, 2017, at 2:08 PM, Irizarry Jr., Nazario <[email protected]> wrote:
>>> 
>>> Yes, we found that to keep the UI in sync (make sure it looks stopped after 
>>> it runs once) the flow controller's stopProcessor() method has to be called.
>>> 
>>> Naz Irizarry
>>> MITRE Corp.
>>> 617-893-0074
>>> 
>>> 
>>> 
>>> On Jan 12, 2017, at 1:41 PM, Brandon DeVries 
>>> <[email protected]<mailto:[email protected]>> wrote:
>>> 
>>> I think answering Joe's question is step one.  However, I was curious and
>>> tried something:
>>> 
>>> public void onTrigger(...){
>>> if(!isSheduled()){
>>>  return;
>>> }
>>> FlowFile flowFile = session.get()
>>> if (flowFile == null){
>>>  return;
>>> }
>>> session.transfer(flowFile, REL_SUCCESS);
>>> updateScheduledFalse();
>>> }
>>> 
>>> This processes one FlowFile per "scheduling".  I.e., one FlowFile goes
>>> through, and you need to stop / start to get another.  I'm not 100% that
>>> nothing else would ever call the "built in" updateScheduled* methods, but
>>> worst case the processor could maintain its own flag.  Also, for what it's
>>> worth, calling updateScheduledFalse() doesn't "stop" the processor on the
>>> graph... as Oleg mentions, this (or something like this) could potentially
>>> be visually confusing.
>>> 
>>> I'm not sure how this fits in a production system, but this +
>>> GenerateFlowFile and some backpressure seems possibly useful for
>>> debugging.  I know I've faked this behavior with a GenerateFlowFile w/ run
>>> schedule "1 day" or something before...  then again, maybe it would be best
>>> to not create something that could be confusing / misused in a production
>>> system.
>>> 
>>> Brandon
>>> 
>>> 
>>> 
>>> 
>>> On Thu, Jan 12, 2017 at 1:02 PM Joe Witt 
>>> <[email protected]<mailto:[email protected]>> wrote:
>>> 
>>> Naz,
>>> 
>>> Why not just leave all the processes running?  If the data only
>>> arrives periodically that is ok, right?
>>> 
>>> Thanks
>>> Joe
>>> 
>>> On Thu, Jan 12, 2017 at 10:54 AM, Irizarry Jr., Nazario 
>>> <[email protected]<mailto:[email protected]>>
>>> wrote:
>>> On a project that I am on we have been looking at using NiFi for
>>> orchestrations that are invoked infrequently.  For example, once a month a
>>> new data input product becomes available and then one wants to run it
>>> through a set of processing steps that can be nicely implemented using NiFi
>>> processors.  However, using the interval or cron scheduling for this
>>> purpose begins to get cumbersome after a while with the need to start and
>>> manually stop these occasional flows.
>>> 
>>> It would be fairly easy to add an additional scheduling option - “Run
>>> Once” for this use case.  The behavior would be that when a processor is
>>> set to run once it automatically stops after it has successfully processed
>>> one input.
>>> 
>>> What do people think?  We are willing to implement this small
>>> enhancement.
>>> 
>>> Cheers,
>>> 
>>> Naz Irizarry
>>> MITRE Corp.
>>> 617-893-0074 <(617)%20893-0074>
>>> 
>> 
>

Re: [DISCUSS] Run Once scheduling

Reply via email to