Re: EXT: Re: Re: [Discuss] Data prioritization - A proposed solution

Bryan Bende Fri, 01 Nov 2019 08:46:37 -0700

Jon,

I've added you to contributors role in JIRA.


Thanks,

Bryan

On Fri, Nov 1, 2019 at 11:40 AM Kessler, Jon <[email protected]> wrote:
>
> I finally got around to creating this ticket: 
> https://issues.apache.org/jira/browse/NIFI-6831. May I please have 
> contributor status so that I can assign it to myself and submit a PR?
> ________________________________
> From: Kessler, Jon <[email protected]>
> Sent: Thursday, October 17, 2019 11:54 AM
> To: [email protected] <[email protected]>
> Subject: EXT: Re: Re: [Discuss] Data prioritization - A proposed solution
>
> Joe, hopefully I addressed all of your questions:
>
> Your interpretation of the scheduling aspect is correct. These queues will 
> pretend to be empty a certain % of the time if higher priority data recently 
> moved elsewhere. That % is configurable on a per rule basis which allows the 
> operator to determine how much to stagger the data associated with each rule. 
> That % is also how the rules are ranked in terms of order of priority. The 
> higher the %, the more often a rule will make use of its threads so the 
> higher its priority is considered to be.
>
> Administration: You are correct that the ruleset is provided at the flow 
> controller level but will be leveraged by all connections regardless of 
> whether or not they use the BucketPrioritizer (more details on this below). 
> This overall solution only works if all FlowFileQueues are of this new 
> implementation which is why I tied it to 
> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fnifi.properties&c=E,1,khT7KpnXsP8UViPixkk5vq9293DkX9lU09S2GeUBaKYVdDr1TnVONYMcykhfPmHfDp0J-xlVYakccQsMx1MQIf0Cut40R_i_TAz1cL6N5nvanw,,&typo=1
>  settings.
>
> The sorting function here takes place on insertion into any connection on 
> which a BucketPrioritizer is set. Once a FlowFile has been sorted into a 
> bucket we maintain that state so that each time it moves into a new 
> connection we already know in which bucket it should be placed without 
> needing to have a BucketPrioritizer set on that connection. Each bucket in a 
> connection is just a FIFO queue so no additional sorting is done. You should 
> only have to configure connections to use the BucketPrioritizer at points in 
> your flow where you believe you'll have enough information to accurately 
> determine priority but not beyond that point unless you want to re-evaluate 
> downstream for some reason. There is administration involved in setting these 
> BucketPrioritizers on some connections but it should be minimal per flow 
> (sometimes as few as one).
>
> Some additional information: When you delete a rule the next time each 
> FlowFile moves that was already associated with that rule it will be 
> re-evaluated against the ruleset when it enters the next connection 
> regardless of whether or not a BucketPrioritizer was set on that connection. 
> Also FlowFiles that have yet to be evaluated (have yet to encounter a 
> BucketPrioritizer) will not be staggered. This was a design decision that if 
> we don't know what a priority is for a given FlowFile we should get it to 
> that point in the flow as soon as possible. This decision was a result of 
> emperical evidence that when we did stagger unevaluated data an incoming flow 
> of high priority data slowed its own upstream processing down once it was 
> identified and processed as high priority.
>
> Multi-tenancy: Agreed that a global priority list could be too restrictive 
> for multi-tenancy and should be addressed.
>
> Per swapping, this is an area where I admittedly need to put more thought 
> into my implementation because there is plenty of room for improvement. Right 
> now I'm just swapping files to disk in order of least to greatest priority 
> but they are all stored together. Therefore they're read back into memory in 
> order of least to greatest priority. More work should be done here.
>
>   - Jon
> ________________________________
> From: Joe Witt <[email protected]>
> Sent: Thursday, October 17, 2019 8:12:52 AM
> To: [email protected]
> Subject: EXT: Re: [Discuss] Data prioritization - A proposed solution
>
> Jon
>
> Probably some details I don't quite understand yet so responses here are to
> get there...
>
> The concept for scheduling is interesting.  Does this basically work around
> the fact that we have an unfair scheduler so this has queue implementations
> which pretend data is not available when it knows that there is higher
> priority data available elsewhere thus returning more threads to the pool
> faster to increase the likelihood that queues with higher priority data
> will get served more often?
>
> The notion of prioritization implies there is a sorting function happening
> somewhere.  NiFi now does sorting on insertion to every queue.  At what
> points are you suggesting sorting can be done/reduced to?
>
> Administration: The existing model does require each prioritizer to be set
> for each queue.  Yours does as well - to opt into this you'd have to select
> the BucketPrioritizer right?  It seems like you're saying the priority
> ruleset would be provided at the flow controller level and be enforced by
> all connections which leverage this prioritizer.  For large multi-tenant
> nifi flows having a global ruleset might be too limiting but maybe we just
> dont worry about that yet.
>
> How does this idea work with the fact that queues as the reach a given
> threshold have their data swapped out to disk and as data gets worked off
> the flowfiles get swapped back into memory?
>
> Thanks
> Joe
>
> On Thu, Oct 17, 2019 at 7:28 AM Kessler, Jon <[email protected]> wrote:
>
> > I want to start a discussion about a new prioritization mechanism that
> > addresses some of the issues that I believe exist in the current solution.
> > These issues are:
> >
> >  - Scheduling: No consideration is given to data priority when determining
> > which component is given the next available thread with which to work
> >  - Constant sorting: Because all flowfiles in a given connection share the
> > same PriorityQueue they must be sorted every time they move. While this
> > sort is efficient it can add up as queues grow deep.
> >  - Administration: There is a costly human element to managing the value
> > used as a priority ranking as priorities change. You must also ensure every
> > connection in the appropriate flow has the proper prioritizer assigned to
> > it to make use of the property.
> >
> > We have developed a prototype of a new FlowFileQueue implementation that
> > addresses these issues. Use of this implementation is controlled via
> > https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fnifi.properties&c=E,1,D-6vNtOC1armix171OZD90B4R0R1UHhG0ngPYawBEOljpSN5kOIfQ8lO9bTcExg_yuJivgzihwwJb6bhpftjXpqwjEl6UEcrh5miqpS_EFIvMdizFFNG&typo=1
> >  so you can opt-in or out system-wide without doing a lot of
> > configuration. Its design goals are:
> >
> >   - Instead of using the value of a FlowFile attribute as a ranking,
> > maintain a set of expression language rules to define your priorities. The
> > highest ranked rule that a given FlowFile satisfies will be that FlowFile's
> > priority
> >   - Because we have a finite set of priority rules we can utilize a bucket
> > sort in our connections. One bucket per priority rule. The bucket/rule with
> > which a FlowFile is associated with will be maintained so that as it moves
> > through the system we do not have to re-evaluate that Flowfile against our
> > ruleset unless we have reason to do so.
> >   - Control where in your flow FlowFiles are evaluated against the ruleset
> > with a new Prioritizer implementation: BucketPrioritizer.
> >   - When this queue implementation is polled it will be able to check
> > state to see if any data of a higher priority than what it currently
> > contains recently (within 5s) moved elsewhere in the system. If higher
> > priority data has recently moved elsewhere, the connection will only
> > provide a FlowFile X% of the time where X is defined along with the rule.
> > This allows higher priority data to have more frequent access to threads
> > without thread-starving lower priority data.
> >   - Rules will be managed via a menu option for the flow and changes to
> > them take effect instantly. This allows you to change your priorities
> > without stopping/editing/restarting various components on the graph.
> >
> > I intend to contribute this solution but first want to solicit input and
> > opinions.
> >
> >   - Jon Kessler
> >
> >

Re: EXT: Re: Re: [Discuss] Data prioritization - A proposed solution

Reply via email to