Hi Marc,

Thanks for the write up in email and on the linked JIRAs. I took a loot just 
now and have some initial thoughts (a lot of this probably goes without saying):

1. I agree that partial failures (eg, slower reads/writes, decreased network 
bandwith, etc) are hard to classify and should stay out of scope for now until 
we tackle complete failures (eg, no disk, no network). 

2. Logging and readme documentation will be important to assist troubleshooting 
/ debugging. If an agent is configured to use a persistent repository, and it 
has degraded to a volatile repository, that could be really confusing to a 
novice user/admin who is trying to figure out how the agent is working. 
Therefore we need to make sure changes to agent behavior that occur as part of 
continuing operations are logged at some level.

3. Have you given any thoughts to testability? Forcing environments that would 
trigger failover capabilities will be difficult, both for developers 
implementing those capabilities and admins / operations folks that want to test 
their configurations before deploying them.

4. I think in a lot of cases, graceful degradation / continued operation of the 
MiNiFi agent will be desirable. However, if we go with that, the corresponding 
controls over the "bounds of the client" as you put it are key (e.g., a 
configuration option for repositories that specifies a failover repository and 
the parameters for when to failover).

5. In terms of utilization caps, I think we should definitely have them, and 
make them configurable where possible. I guess this is another way to express 
the bounds of the clients, eg "do whatever you need to keep running, but never 
use more than XXMB of memory". Disk/memory footprints of persistent/volatile 
repositories are probably easy ones to start with. There should be 
default/built-in prioritizers for deciding which flow files to drop when the 
cap is reached, and over time we can make that extensible. I think this is in 
line with  Joe's comment on the JIRA [1] that data from different sensors will 
likely have different importance and we need a way to deal with that. At the 
end of the day, if a flow is failing, but inputs are still coming in, and the 
agent has a utilization cap... something has to be dropped. 

6. There might be some concepts from the mobile platform space that we could 
carry over to the design of the agent. For example, on iOS, the OS is able to 
send lots of signals to apps regarding what is happening at the platform level, 
and the app can be implemented to act appropriately in different scenarios. For 
example, a memory warning for which apps are supposed to dispose of any 
volatile resources that are nonessential or can be recreated, or a signal that 
the app is about to enter a background state. Maybe there are some good designs 
that can be carried over so custom processors have push/pull hooks into the 
state of the platform that is provided by the framework. Eg, maybe a processor 
wants to have conditional logic based on the state of memory or network i/o and 
the minifi framework has APIs that make that discoverable (pull), and perhaps 
all custom processors can implement an interface that allows them to receive 
notifications from the framework when it detects some of these
  partial / complete failure conditions or is approaching configured 
utilization caps (push).

I've watched both JIRAs and will follow this thread as well. I'll chime in with 
more after I have time to think about this more and as more people respond. I 
agree input from people with experiences from the field would be really useful 
here.

Kevin

[1] 
https://issues.apache.org/jira/browse/MINIFI-356?focusedCommentId=16108832#comment-16108832

On 8/1/17, 09:59, "Marc" <[email protected]> wrote:

    Good Morning,
    
      I've begun capturing some details in a ticket for durability and
    reliability of MiNiFi C++ clients [1]. The scope of this ticket is
    continuing operations despite failure within specific components. There is
    a linked ticket [2] attempts to address some of the concerns brought up in
    MINIFI-356, focusing no memory usage.
    
      The spirit of the ticket was meant to capture conditions of known
    failure; however, given that more discussion has blossomed, I'd like to
    assess the experience of the mailing list. Continuing operations in any
    environment is difficult, particularly one in which we likely have little
    to no control. Simply gathering information to know when a failure is
    occurring is a major part of the battle. According to the tickets, there
    needs to be some discussion of how we classify failure.
    
      The ticket addressed the low hanging fruit, but there are certainly more
    conditions of failure. If a disk switches to read/write mode, disks becomes
    full and/or out of inode entries etc, we know a complete failure occurred
    and thus can switch our type of write activity to use a volatile repo. I
    recognize that partial failures may occur, but how do we classify these?
    Should we classify these at all or would this be venturing into a rabbit
    hole?
    
       For memory we can likely throttle queue sizes as needed. For networking
    and other components we could likely find other measures of failure. The
    goal, no matter the component, is to continue operations without human
    intervention -- with the hope that the configuration makes the bounds of
    the client obvious.
    
       My gut reaction is to separate partial failure as the low hanging fruit
    of complete failure is much easier to address, but would love to hear the
    reaction of this list. Further, any input on the types of failures to
    address would be appreciated. Look forward to any and all responses.
    
      Best Regards,
      Marc
    
    [1] https://issues.apache.org/jira/browse/MINIFI-356
    [2] https://issues.apache.org/jira/browse/MINIFI-360
    


Reply via email to