[graylog2] Random exceptions on large datasets / lost messages

Marcel Manz Fri, 04 Sep 2015 06:44:29 -0700

Hi all

We have a setup of 2 graylog servers (1.1.6), both of which are running ES 
1.7.1 in redundant setup behind a load balancer.

When we do searches over a longer period of time (eg. 1 month search, which
involves approximately 300 million messages) we several times managed to
get an exception in the web interface, that in worst case caused either the
graylog server process or elasticsearch to fail and required restarting
those services.

Yesterday such exception happened to us on a search, for which Graylog
couldn't write anymore to ES and started filling up its internal journal.
After we restarted ES and ES recovered the indexes, the graylog journal got
flushed to ES. Unfortunately when we now search and look in the histogram,
we don't see any messages for the short period the outage happened.

We already tried recalculating the index ranges (completed successfully),
but the messages still don't show up. As we could clearly see that messages
got queued in GL's journal (> 100 K messages during the few minute window)
and then flushed to ES, we believe that the messages actually got stored in
ES, but somehow GL is unable to see them.

How can we investigate this, as it concerns us that messages could be lost,
even though GL's journal was used during time of error.

Thanks

Best regards,
Marcel

--
You received this message because you are subscribed to the Google Groups
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/graylog2/5544752d-08d4-4505-8ff0-9eaa7fc73fd0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[graylog2] Random exceptions on large datasets / lost messages

Reply via email to