Re: [DISCUSS] Increasing durability in MiNiFi C++

Marc Tue, 01 Aug 2017 08:11:01 -0700

Hey Kevin,
   These are really good points. I like the concepts laid out in number
six. That helps solidify my belief that there is a greater scope of
durability and reliability that is better captured in a series of tickets
beyond the original ticket's intent. Certainly a good idea to take a page
from the mobile platforms' play book. I think a notification model applies
ties into Andy's previous response RE sandboxing. Not immediately sure the
best way to tackle that but I'll put some thoughts into a ticket.


   Regarding testability: My thought was that we should leverage some of
the work being done for containerization to help guide our testing. We can
certainly make arbitrary test environments to set a file system into
read/write mode, consume all memory in a queue, etc. Whether that is good
enough remains to be seen. With our current unit test and integration
tests, this is much more difficult to replicate as opposed to a container
where we have the freedom to 'break stuff'. I haven't fully scoped out what
is needed for testability, so ideas are certainly welcome. Unfortunately my
ideas/plans are in their infancy.



On Tue, Aug 1, 2017 at 10:56 AM, Kevin Doran <[email protected]>
wrote:

> Hi Marc,
>
> Thanks for the write up in email and on the linked JIRAs. I took a loot
> just now and have some initial thoughts (a lot of this probably goes
> without saying):
>
> 1. I agree that partial failures (eg, slower reads/writes, decreased
> network bandwith, etc) are hard to classify and should stay out of scope
> for now until we tackle complete failures (eg, no disk, no network).
>
> 2. Logging and readme documentation will be important to assist
> troubleshooting / debugging. If an agent is configured to use a persistent
> repository, and it has degraded to a volatile repository, that could be
> really confusing to a novice user/admin who is trying to figure out how the
> agent is working. Therefore we need to make sure changes to agent behavior
> that occur as part of continuing operations are logged at some level.
>
> 3. Have you given any thoughts to testability? Forcing environments that
> would trigger failover capabilities will be difficult, both for developers
> implementing those capabilities and admins / operations folks that want to
> test their configurations before deploying them.
>
> 4. I think in a lot of cases, graceful degradation / continued operation
> of the MiNiFi agent will be desirable. However, if we go with that, the
> corresponding controls over the "bounds of the client" as you put it are
> key (e.g., a configuration option for repositories that specifies a
> failover repository and the parameters for when to failover).
>
> 5. In terms of utilization caps, I think we should definitely have them,
> and make them configurable where possible. I guess this is another way to
> express the bounds of the clients, eg "do whatever you need to keep
> running, but never use more than XXMB of memory". Disk/memory footprints of
> persistent/volatile repositories are probably easy ones to start with.
> There should be default/built-in prioritizers for deciding which flow files
> to drop when the cap is reached, and over time we can make that extensible.
> I think this is in line with  Joe's comment on the JIRA [1] that data from
> different sensors will likely have different importance and we need a way
> to deal with that. At the end of the day, if a flow is failing, but inputs
> are still coming in, and the agent has a utilization cap... something has
> to be dropped.
>
> 6. There might be some concepts from the mobile platform space that we
> could carry over to the design of the agent. For example, on iOS, the OS is
> able to send lots of signals to apps regarding what is happening at the
> platform level, and the app can be implemented to act appropriately in
> different scenarios. For example, a memory warning for which apps are
> supposed to dispose of any volatile resources that are nonessential or can
> be recreated, or a signal that the app is about to enter a background
> state. Maybe there are some good designs that can be carried over so custom
> processors have push/pull hooks into the state of the platform that is
> provided by the framework. Eg, maybe a processor wants to have conditional
> logic based on the state of memory or network i/o and the minifi framework
> has APIs that make that discoverable (pull), and perhaps all custom
> processors can implement an interface that allows them to receive
> notifications from the framework when it detects some of these
>   partial / complete failure conditions or is approaching configured
> utilization caps (push).
>
> I've watched both JIRAs and will follow this thread as well. I'll chime in
> with more after I have time to think about this more and as more people
> respond. I agree input from people with experiences from the field would be
> really useful here.
>
> Kevin
>
> [1] https://issues.apache.org/jira/browse/MINIFI-356?
> focusedCommentId=16108832#comment-16108832
>
> On 8/1/17, 09:59, "Marc" <[email protected]> wrote:
>
>     Good Morning,
>
>       I've begun capturing some details in a ticket for durability and
>     reliability of MiNiFi C++ clients [1]. The scope of this ticket is
>     continuing operations despite failure within specific components.
> There is
>     a linked ticket [2] attempts to address some of the concerns brought
> up in
>     MINIFI-356, focusing no memory usage.
>
>       The spirit of the ticket was meant to capture conditions of known
>     failure; however, given that more discussion has blossomed, I'd like to
>     assess the experience of the mailing list. Continuing operations in any
>     environment is difficult, particularly one in which we likely have
> little
>     to no control. Simply gathering information to know when a failure is
>     occurring is a major part of the battle. According to the tickets,
> there
>     needs to be some discussion of how we classify failure.
>
>       The ticket addressed the low hanging fruit, but there are certainly
> more
>     conditions of failure. If a disk switches to read/write mode, disks
> becomes
>     full and/or out of inode entries etc, we know a complete failure
> occurred
>     and thus can switch our type of write activity to use a volatile repo.
> I
>     recognize that partial failures may occur, but how do we classify
> these?
>     Should we classify these at all or would this be venturing into a
> rabbit
>     hole?
>
>        For memory we can likely throttle queue sizes as needed. For
> networking
>     and other components we could likely find other measures of failure.
> The
>     goal, no matter the component, is to continue operations without human
>     intervention -- with the hope that the configuration makes the bounds
> of
>     the client obvious.
>
>        My gut reaction is to separate partial failure as the low hanging
> fruit
>     of complete failure is much easier to address, but would love to hear
> the
>     reaction of this list. Further, any input on the types of failures to
>     address would be appreciated. Look forward to any and all responses.
>
>       Best Regards,
>       Marc
>
>     [1] https://issues.apache.org/jira/browse/MINIFI-356
>     [2] https://issues.apache.org/jira/browse/MINIFI-360
>
>
>
>

Re: [DISCUSS] Increasing durability in MiNiFi C++

Reply via email to