Leveraging containerization sounds like a solid testing approach. It could be automated which fits in nicely for enterprise environments that might want to test configuration changes in an emulated environment before pushing out to every device.
And, yes certainly more tickets would be welcome rather than piling this all onto MINIFI-356. Let me know if you want me to help put those together or collaborate on the design. Thanks all, Kevin On 8/1/17, 11:09, "Marc" <[email protected]> wrote: Hey Kevin, These are really good points. I like the concepts laid out in number six. That helps solidify my belief that there is a greater scope of durability and reliability that is better captured in a series of tickets beyond the original ticket's intent. Certainly a good idea to take a page from the mobile platforms' play book. I think a notification model applies ties into Andy's previous response RE sandboxing. Not immediately sure the best way to tackle that but I'll put some thoughts into a ticket. Regarding testability: My thought was that we should leverage some of the work being done for containerization to help guide our testing. We can certainly make arbitrary test environments to set a file system into read/write mode, consume all memory in a queue, etc. Whether that is good enough remains to be seen. With our current unit test and integration tests, this is much more difficult to replicate as opposed to a container where we have the freedom to 'break stuff'. I haven't fully scoped out what is needed for testability, so ideas are certainly welcome. Unfortunately my ideas/plans are in their infancy. On Tue, Aug 1, 2017 at 10:56 AM, Kevin Doran <[email protected]> wrote: > Hi Marc, > > Thanks for the write up in email and on the linked JIRAs. I took a loot > just now and have some initial thoughts (a lot of this probably goes > without saying): > > 1. I agree that partial failures (eg, slower reads/writes, decreased > network bandwith, etc) are hard to classify and should stay out of scope > for now until we tackle complete failures (eg, no disk, no network). > > 2. Logging and readme documentation will be important to assist > troubleshooting / debugging. If an agent is configured to use a persistent > repository, and it has degraded to a volatile repository, that could be > really confusing to a novice user/admin who is trying to figure out how the > agent is working. Therefore we need to make sure changes to agent behavior > that occur as part of continuing operations are logged at some level. > > 3. Have you given any thoughts to testability? Forcing environments that > would trigger failover capabilities will be difficult, both for developers > implementing those capabilities and admins / operations folks that want to > test their configurations before deploying them. > > 4. I think in a lot of cases, graceful degradation / continued operation > of the MiNiFi agent will be desirable. However, if we go with that, the > corresponding controls over the "bounds of the client" as you put it are > key (e.g., a configuration option for repositories that specifies a > failover repository and the parameters for when to failover). > > 5. In terms of utilization caps, I think we should definitely have them, > and make them configurable where possible. I guess this is another way to > express the bounds of the clients, eg "do whatever you need to keep > running, but never use more than XXMB of memory". Disk/memory footprints of > persistent/volatile repositories are probably easy ones to start with. > There should be default/built-in prioritizers for deciding which flow files > to drop when the cap is reached, and over time we can make that extensible. > I think this is in line with Joe's comment on the JIRA [1] that data from > different sensors will likely have different importance and we need a way > to deal with that. At the end of the day, if a flow is failing, but inputs > are still coming in, and the agent has a utilization cap... something has > to be dropped. > > 6. There might be some concepts from the mobile platform space that we > could carry over to the design of the agent. For example, on iOS, the OS is > able to send lots of signals to apps regarding what is happening at the > platform level, and the app can be implemented to act appropriately in > different scenarios. For example, a memory warning for which apps are > supposed to dispose of any volatile resources that are nonessential or can > be recreated, or a signal that the app is about to enter a background > state. Maybe there are some good designs that can be carried over so custom > processors have push/pull hooks into the state of the platform that is > provided by the framework. Eg, maybe a processor wants to have conditional > logic based on the state of memory or network i/o and the minifi framework > has APIs that make that discoverable (pull), and perhaps all custom > processors can implement an interface that allows them to receive > notifications from the framework when it detects some of these > partial / complete failure conditions or is approaching configured > utilization caps (push). > > I've watched both JIRAs and will follow this thread as well. I'll chime in > with more after I have time to think about this more and as more people > respond. I agree input from people with experiences from the field would be > really useful here. > > Kevin > > [1] https://issues.apache.org/jira/browse/MINIFI-356? > focusedCommentId=16108832#comment-16108832 > > On 8/1/17, 09:59, "Marc" <[email protected]> wrote: > > Good Morning, > > I've begun capturing some details in a ticket for durability and > reliability of MiNiFi C++ clients [1]. The scope of this ticket is > continuing operations despite failure within specific components. > There is > a linked ticket [2] attempts to address some of the concerns brought > up in > MINIFI-356, focusing no memory usage. > > The spirit of the ticket was meant to capture conditions of known > failure; however, given that more discussion has blossomed, I'd like to > assess the experience of the mailing list. Continuing operations in any > environment is difficult, particularly one in which we likely have > little > to no control. Simply gathering information to know when a failure is > occurring is a major part of the battle. According to the tickets, > there > needs to be some discussion of how we classify failure. > > The ticket addressed the low hanging fruit, but there are certainly > more > conditions of failure. If a disk switches to read/write mode, disks > becomes > full and/or out of inode entries etc, we know a complete failure > occurred > and thus can switch our type of write activity to use a volatile repo. > I > recognize that partial failures may occur, but how do we classify > these? > Should we classify these at all or would this be venturing into a > rabbit > hole? > > For memory we can likely throttle queue sizes as needed. For > networking > and other components we could likely find other measures of failure. > The > goal, no matter the component, is to continue operations without human > intervention -- with the hope that the configuration makes the bounds > of > the client obvious. > > My gut reaction is to separate partial failure as the low hanging > fruit > of complete failure is much easier to address, but would love to hear > the > reaction of this list. Further, any input on the types of failures to > address would be appreciated. Look forward to any and all responses. > > Best Regards, > Marc > > [1] https://issues.apache.org/jira/browse/MINIFI-356 > [2] https://issues.apache.org/jira/browse/MINIFI-360 > > > >
