I feel that MINIFI-356 is pretty key in all things, when I think of jagged edge use cases that are missing connectivity for days but have large mass storage devices this feels really limiting. When I consider the variety of devices I have tested with thus far most of them only have a single storage media mount. RasPi's right now seem to be the de-typical entry IoT and most are only using the single media mount this to me represents an area where even operating in degraded mode won't help us as the OS will fail on its own eventually without its disk.
With that said is it more valuable to use the storage media we have initially then it is to find a way to run without it? No doubt there are other scenarios where this is very useful and I see more of them initially in the 'non-jagged' space. For example a factory line PC within the Enterprise network is always connected, it may never experience backpressure soley because it can send as fast as it collects the data. If we assume that the OS disks and Repo disks are not the same, and the repo did fail there would be value in continuing to operate collecting and sending data, but for all intents we dont care about backpressure here becuase we can still send it as fast as its collected. ~~Kevins' Response's 2. Logging and readme documentation will be important to assist troubleshooting / debugging. If an agent is configured to use a persistent repository, and it has degraded to a volatile repository, that could be really confusing to a novice user/admin who is trying to figure out how the agent is working. Therefore we need to make sure changes to agent behavior that occur as part of continuing operations are logged at some level. I would also expect initially its default off, and has to be manually enabled. 3. Testing Just intally thinking I can re-use a RasPi but attach an ESATA, a hard failure of removing the drive itself, or unmounting it at the OS level may do this. While leaving the OS drive (SD card) still plugged in. On Tue, Aug 1, 2017 at 9:59 AM, Marc <[email protected]> wrote: > Good Morning, > > I've begun capturing some details in a ticket for durability and > reliability of MiNiFi C++ clients [1]. The scope of this ticket is > continuing operations despite failure within specific components. There is > a linked ticket [2] attempts to address some of the concerns brought up in > MINIFI-356, focusing no memory usage. > > The spirit of the ticket was meant to capture conditions of known > failure; however, given that more discussion has blossomed, I'd like to > assess the experience of the mailing list. Continuing operations in any > environment is difficult, particularly one in which we likely have little > to no control. Simply gathering information to know when a failure is > occurring is a major part of the battle. According to the tickets, there > needs to be some discussion of how we classify failure. > > The ticket addressed the low hanging fruit, but there are certainly more > conditions of failure. If a disk switches to read/write mode, disks becomes > full and/or out of inode entries etc, we know a complete failure occurred > and thus can switch our type of write activity to use a volatile repo. I > recognize that partial failures may occur, but how do we classify these? > Should we classify these at all or would this be venturing into a rabbit > hole? > > For memory we can likely throttle queue sizes as needed. For networking > and other components we could likely find other measures of failure. The > goal, no matter the component, is to continue operations without human > intervention -- with the hope that the configuration makes the bounds of > the client obvious. > > My gut reaction is to separate partial failure as the low hanging fruit > of complete failure is much easier to address, but would love to hear the > reaction of this list. Further, any input on the types of failures to > address would be appreciated. Look forward to any and all responses. > > Best Regards, > Marc > > [1] https://issues.apache.org/jira/browse/MINIFI-356 > [2] https://issues.apache.org/jira/browse/MINIFI-360 > -- Joseph
