Hey Kevin, These are really good points. I like the concepts laid out in number six. That helps solidify my belief that there is a greater scope of durability and reliability that is better captured in a series of tickets beyond the original ticket's intent. Certainly a good idea to take a page from the mobile platforms' play book. I think a notification model applies ties into Andy's previous response RE sandboxing. Not immediately sure the best way to tackle that but I'll put some thoughts into a ticket.
Regarding testability: My thought was that we should leverage some of the work being done for containerization to help guide our testing. We can certainly make arbitrary test environments to set a file system into read/write mode, consume all memory in a queue, etc. Whether that is good enough remains to be seen. With our current unit test and integration tests, this is much more difficult to replicate as opposed to a container where we have the freedom to 'break stuff'. I haven't fully scoped out what is needed for testability, so ideas are certainly welcome. Unfortunately my ideas/plans are in their infancy. On Tue, Aug 1, 2017 at 10:56 AM, Kevin Doran <[email protected]> wrote: > Hi Marc, > > Thanks for the write up in email and on the linked JIRAs. I took a loot > just now and have some initial thoughts (a lot of this probably goes > without saying): > > 1. I agree that partial failures (eg, slower reads/writes, decreased > network bandwith, etc) are hard to classify and should stay out of scope > for now until we tackle complete failures (eg, no disk, no network). > > 2. Logging and readme documentation will be important to assist > troubleshooting / debugging. If an agent is configured to use a persistent > repository, and it has degraded to a volatile repository, that could be > really confusing to a novice user/admin who is trying to figure out how the > agent is working. Therefore we need to make sure changes to agent behavior > that occur as part of continuing operations are logged at some level. > > 3. Have you given any thoughts to testability? Forcing environments that > would trigger failover capabilities will be difficult, both for developers > implementing those capabilities and admins / operations folks that want to > test their configurations before deploying them. > > 4. I think in a lot of cases, graceful degradation / continued operation > of the MiNiFi agent will be desirable. However, if we go with that, the > corresponding controls over the "bounds of the client" as you put it are > key (e.g., a configuration option for repositories that specifies a > failover repository and the parameters for when to failover). > > 5. In terms of utilization caps, I think we should definitely have them, > and make them configurable where possible. I guess this is another way to > express the bounds of the clients, eg "do whatever you need to keep > running, but never use more than XXMB of memory". Disk/memory footprints of > persistent/volatile repositories are probably easy ones to start with. > There should be default/built-in prioritizers for deciding which flow files > to drop when the cap is reached, and over time we can make that extensible. > I think this is in line with Joe's comment on the JIRA [1] that data from > different sensors will likely have different importance and we need a way > to deal with that. At the end of the day, if a flow is failing, but inputs > are still coming in, and the agent has a utilization cap... something has > to be dropped. > > 6. There might be some concepts from the mobile platform space that we > could carry over to the design of the agent. For example, on iOS, the OS is > able to send lots of signals to apps regarding what is happening at the > platform level, and the app can be implemented to act appropriately in > different scenarios. For example, a memory warning for which apps are > supposed to dispose of any volatile resources that are nonessential or can > be recreated, or a signal that the app is about to enter a background > state. Maybe there are some good designs that can be carried over so custom > processors have push/pull hooks into the state of the platform that is > provided by the framework. Eg, maybe a processor wants to have conditional > logic based on the state of memory or network i/o and the minifi framework > has APIs that make that discoverable (pull), and perhaps all custom > processors can implement an interface that allows them to receive > notifications from the framework when it detects some of these > partial / complete failure conditions or is approaching configured > utilization caps (push). > > I've watched both JIRAs and will follow this thread as well. I'll chime in > with more after I have time to think about this more and as more people > respond. I agree input from people with experiences from the field would be > really useful here. > > Kevin > > [1] https://issues.apache.org/jira/browse/MINIFI-356? > focusedCommentId=16108832#comment-16108832 > > On 8/1/17, 09:59, "Marc" <[email protected]> wrote: > > Good Morning, > > I've begun capturing some details in a ticket for durability and > reliability of MiNiFi C++ clients [1]. The scope of this ticket is > continuing operations despite failure within specific components. > There is > a linked ticket [2] attempts to address some of the concerns brought > up in > MINIFI-356, focusing no memory usage. > > The spirit of the ticket was meant to capture conditions of known > failure; however, given that more discussion has blossomed, I'd like to > assess the experience of the mailing list. Continuing operations in any > environment is difficult, particularly one in which we likely have > little > to no control. Simply gathering information to know when a failure is > occurring is a major part of the battle. According to the tickets, > there > needs to be some discussion of how we classify failure. > > The ticket addressed the low hanging fruit, but there are certainly > more > conditions of failure. If a disk switches to read/write mode, disks > becomes > full and/or out of inode entries etc, we know a complete failure > occurred > and thus can switch our type of write activity to use a volatile repo. > I > recognize that partial failures may occur, but how do we classify > these? > Should we classify these at all or would this be venturing into a > rabbit > hole? > > For memory we can likely throttle queue sizes as needed. For > networking > and other components we could likely find other measures of failure. > The > goal, no matter the component, is to continue operations without human > intervention -- with the hope that the configuration makes the bounds > of > the client obvious. > > My gut reaction is to separate partial failure as the low hanging > fruit > of complete failure is much easier to address, but would love to hear > the > reaction of this list. Further, any input on the types of failures to > address would be appreciated. Look forward to any and all responses. > > Best Regards, > Marc > > [1] https://issues.apache.org/jira/browse/MINIFI-356 > [2] https://issues.apache.org/jira/browse/MINIFI-360 > > > >
