Thinking about failure/recovery mode ... Let's say we've had an incident, our search system flopped over and died or something, and graylog is saturated with messages; graylog is busy flushing out as fast as possible in order to catch up.
In this mode, for the duration of this "catching up" phase, we're blacked out on realtime flow of log events. I'm wondering how we reduce the time to recover on this. My customers want to use this system for a variety of purposes but mainly the data must be fresh and hot. One way I'm thinking about doing this, and I admit I don't particularly like this idea, is to spin up a bunch of new graylog instances and cut the realtime traffic into it - at least this way we can bypass the saturated graylogs and get fresher data into the indexes. The penalty with this, of course, is that it requires more nodes to be added, more connections, it feels like a fair amount of change to incur to recover. Maybe that's the best we can do. I'm wondering if anyone out there either has established and trusted runbooks (or fail safe implementations) for responding to this scenario and would be willing to share advice. I'm also wondering if this is a feature area the Graylog dev team has been thinking about, and would love to get your thoughts. Hope you're having a great Monday. Dave -- You received this message because you are subscribed to the Google Groups "Graylog Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/graylog2/CAJqE1Kdx5kWyjZLim_Pcw5r7GLFdmwgffrZ%3D8gyiOwnFsxsixw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
