Good Morning, I've begun capturing some details in a ticket for durability and reliability of MiNiFi C++ clients [1]. The scope of this ticket is continuing operations despite failure within specific components. There is a linked ticket [2] attempts to address some of the concerns brought up in MINIFI-356, focusing no memory usage.
The spirit of the ticket was meant to capture conditions of known failure; however, given that more discussion has blossomed, I'd like to assess the experience of the mailing list. Continuing operations in any environment is difficult, particularly one in which we likely have little to no control. Simply gathering information to know when a failure is occurring is a major part of the battle. According to the tickets, there needs to be some discussion of how we classify failure. The ticket addressed the low hanging fruit, but there are certainly more conditions of failure. If a disk switches to read/write mode, disks becomes full and/or out of inode entries etc, we know a complete failure occurred and thus can switch our type of write activity to use a volatile repo. I recognize that partial failures may occur, but how do we classify these? Should we classify these at all or would this be venturing into a rabbit hole? For memory we can likely throttle queue sizes as needed. For networking and other components we could likely find other measures of failure. The goal, no matter the component, is to continue operations without human intervention -- with the hope that the configuration makes the bounds of the client obvious. My gut reaction is to separate partial failure as the low hanging fruit of complete failure is much easier to address, but would love to hear the reaction of this list. Further, any input on the types of failures to address would be appreciated. Look forward to any and all responses. Best Regards, Marc [1] https://issues.apache.org/jira/browse/MINIFI-356 [2] https://issues.apache.org/jira/browse/MINIFI-360
