Re: [DISCUSS] Run Once scheduling

Joe Witt Tue, 31 Jan 2017 07:29:04 -0800

Hello

You will first want to create a JIRA describing the work/idea being
done.  Then in the commit log be sure to reference NIFI-XXXX.


Take a look here for a helpful guide on how best to help the community
land contributions.

https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide

Thanks
Joe

On Tue, Jan 31, 2017 at 10:17 AM, Irizarry Jr., Nazario <[email protected]> wrote:
> I am about to submit a PR for an implementation of the run-once scheduling.  
> There is no outstanding JIRA ticket on this so what kind of NIFI-XXXX or 
> other labeling should I put into the title of the PR?
>
> Thanks,
>
> Naz Irizarry
> MITRE Corp.
> 617-893-0074
>
>
>
>> On Jan 12, 2017, at 3:55 PM, Irizarry Jr., Nazario <[email protected]> wrote:
>>
>> I think it is a matter of the model in one's head.  If one thinks of a 
>> continuous activation paradigm the green arrow versus red square indicate 
>> what you point out.  On the other hand in an ad-hoc run-once paradigm the 
>> green arrow is a nice succinct indicator of what has not run yet.  In an 
>> analytics environment processing can take minutes to hours for some 
>> processors.  As  processing goes on the processors with the remaining green 
>> arrows indicate what is left to complete in the “visual script.”
>>
>> Consider the following example. Say there there are five processors. The 
>> first processor, say A, makes a query and gets data.  Depending on what I 
>> know about today’s input to A the output should be directed to B1, B2, B3, 
>> or B4.  The B's are actually variations on a particular analytic algorithm 
>> and most of the time only one of them needs to be used.  On one day (based 
>> on external knowledge) I click on A and B1 and then the Start arrow.  On 
>> another day I modify the query, click on A and B2 and then click on the 
>> Start arrow.  etc, Clearly I could have four flows and I could start/stop 
>> entire flows.  But, as the number of processing stages increases and the 
>> number of processing alternatives increases at each stage the combinatorial 
>> growth makes distinct flows painful to manage.  Sometimes it is easier to 
>> have one all encompassing flow and then allow the analyst to shift click the 
>> portions they want to invoke for the next “run."
>>
>>
>> Naz Irizarry
>> MITRE Corp.
>> 617-893-0074
>>
>>
>>
>>> On Jan 12, 2017, at 2:14 PM, Joe Witt <[email protected]> wrote:
>>>
>>> Naz
>>>
>>> The green arrow vs red square says "scheduled to execute" vs "not
>>> scheduled to execute".  For most processors, such as those which take
>>> input flow files from a connection, even if they're scheduled to run
>>> they're not going to be executed unless there is work to do (data
>>> sitting in the queue) and space available (on all destination
>>> relationships).  Because of this I'm suggesting to consider just
>>> leaving them all scheduled to execute even though they won't actually
>>> be doing anything most of the time.  The stats on each component tell
>>> you how many times it was actually invoked and how much data it
>>> processed, etc..  So you'll see that they're not doing anything most
>>> of the time.
>>>
>>> You mentioned not wanting to have to do anything manual yet run once
>>> would be a manual construct, right?
>>>
>>> I dont mean to suggest I'm closed off to the idea of a run once
>>> concept I just really want to understand your use case better.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Jan 12, 2017 at 2:11 PM, Irizarry Jr., Nazario <[email protected]> 
>>> wrote:
>>>> Correction, that was the processor scheduler’s stopProcessor() method that 
>>>> needs to be invoked so the UI shows that the processor is stopped.
>>>>
>>>> Naz Irizarry
>>>> MITRE Corp.
>>>> 617-893-0074
>>>>
>>>>
>>>>
>>>>> On Jan 12, 2017, at 2:08 PM, Irizarry Jr., Nazario <[email protected]> wrote:
>>>>>
>>>>> Yes, we found that to keep the UI in sync (make sure it looks stopped 
>>>>> after it runs once) the flow controller's stopProcessor() method has to 
>>>>> be called.
>>>>>
>>>>> Naz Irizarry
>>>>> MITRE Corp.
>>>>> 617-893-0074
>>>>>
>>>>>
>>>>>
>>>>> On Jan 12, 2017, at 1:41 PM, Brandon DeVries 
>>>>> <[email protected]<mailto:[email protected]>> wrote:
>>>>>
>>>>> I think answering Joe's question is step one.  However, I was curious and
>>>>> tried something:
>>>>>
>>>>> public void onTrigger(...){
>>>>> if(!isSheduled()){
>>>>> return;
>>>>> }
>>>>> FlowFile flowFile = session.get()
>>>>> if (flowFile == null){
>>>>> return;
>>>>> }
>>>>> session.transfer(flowFile, REL_SUCCESS);
>>>>> updateScheduledFalse();
>>>>> }
>>>>>
>>>>> This processes one FlowFile per "scheduling".  I.e., one FlowFile goes
>>>>> through, and you need to stop / start to get another.  I'm not 100% that
>>>>> nothing else would ever call the "built in" updateScheduled* methods, but
>>>>> worst case the processor could maintain its own flag.  Also, for what it's
>>>>> worth, calling updateScheduledFalse() doesn't "stop" the processor on the
>>>>> graph... as Oleg mentions, this (or something like this) could potentially
>>>>> be visually confusing.
>>>>>
>>>>> I'm not sure how this fits in a production system, but this +
>>>>> GenerateFlowFile and some backpressure seems possibly useful for
>>>>> debugging.  I know I've faked this behavior with a GenerateFlowFile w/ run
>>>>> schedule "1 day" or something before...  then again, maybe it would be 
>>>>> best
>>>>> to not create something that could be confusing / misused in a production
>>>>> system.
>>>>>
>>>>> Brandon
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 12, 2017 at 1:02 PM Joe Witt 
>>>>> <[email protected]<mailto:[email protected]>> wrote:
>>>>>
>>>>> Naz,
>>>>>
>>>>> Why not just leave all the processes running?  If the data only
>>>>> arrives periodically that is ok, right?
>>>>>
>>>>> Thanks
>>>>> Joe
>>>>>
>>>>> On Thu, Jan 12, 2017 at 10:54 AM, Irizarry Jr., Nazario 
>>>>> <[email protected]<mailto:[email protected]>>
>>>>> wrote:
>>>>> On a project that I am on we have been looking at using NiFi for
>>>>> orchestrations that are invoked infrequently.  For example, once a month a
>>>>> new data input product becomes available and then one wants to run it
>>>>> through a set of processing steps that can be nicely implemented using 
>>>>> NiFi
>>>>> processors.  However, using the interval or cron scheduling for this
>>>>> purpose begins to get cumbersome after a while with the need to start and
>>>>> manually stop these occasional flows.
>>>>>
>>>>> It would be fairly easy to add an additional scheduling option - “Run
>>>>> Once” for this use case.  The behavior would be that when a processor is
>>>>> set to run once it automatically stops after it has successfully processed
>>>>> one input.
>>>>>
>>>>> What do people think?  We are willing to implement this small
>>>>> enhancement.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Naz Irizarry
>>>>> MITRE Corp.
>>>>> 617-893-0074 <(617)%20893-0074>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] Run Once scheduling

Reply via email to