Thinking about failure/recovery mode ...

Let's say we've had an incident, our search system flopped over and died or
something, and graylog is saturated with messages; graylog is busy flushing
out as fast as possible in order to catch up.

In this mode, for the duration of this "catching up" phase, we're blacked
out on realtime flow of log events.

I'm wondering how we reduce the time to recover on this. My customers want
to use this system for a variety of purposes but mainly the data must be
fresh and hot.

One way I'm thinking about doing this, and I admit I don't particularly
like this idea, is to spin up a bunch of new graylog instances and cut the
realtime traffic into it - at least this way we can bypass the saturated
graylogs and get fresher data into the indexes.

The penalty with this, of course, is that it requires more nodes to be
added, more connections, it feels like a fair amount of change to incur to
recover. Maybe that's the best we can do.

I'm wondering if anyone out there either has established and trusted
runbooks (or fail safe implementations) for responding to this scenario and
would be willing to share advice.

I'm also wondering if this is a feature area the Graylog dev team has been
thinking about, and would love to get your thoughts.

Hope you're having a great Monday.

Dave

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/CAJqE1Kdx5kWyjZLim_Pcw5r7GLFdmwgffrZ%3D8gyiOwnFsxsixw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to