Hi there

I've got a graylog-1.1.3 instance (web/server+elasticsearch) running
(CentOS-7) that I haven't changed INPUTs on for some months (ie I have one
incoming syslog feed and 'n' GELF feeds). From what I know, graylog-server
takes that data and pushes it into elasticsearch according to the
sharding/etc settings, with auto-expiring of old data according to settings

As such, I would expect it to get into a "steady state" where the
fundamental OS characteristics are fairly stable? ie it would use just
about "this much" RAM, "this many open files", etc.

Anyhow, two days it totally went down as it ran out of open file
descriptors. Ended up corrupting over 9000 indexes before I noticed - a
real mess. I increased nofiles, rebooted and then used that very nice
script referred to below to re-absorb the borked indexes

https://github.com/elastic/elasticsearch/issues/4206

So the thing I don't understand is why this happened (or why didn't this
happen sooner)? In a steady state environment, why would the number of open
files be increasing over time? eg only one index is open for write at any
moment, and indexes are only open for read during searches, so why would
this increase? More importantly, if this increase is meant to happen,
doesn't that imply running out of file descriptors is inevitable?

The other thing is why didn't graylog-server exit when this situation
occurred? It seems to me that when elasticsearch started erroring, it
should have exited (I mean, you don't recover from running out of file
descriptors), but as it didn't, then why didn't graylog-server? Under what
situation is it better to end up with 9000 corrupt indexes rather than a
total outage? I'm still waiting for elasticsearch to finish re-assigning
the unassigned_shards created by the above recovery process - it's working,
but it's been 8 hours so far and it's still plodding along (so it's a two
day outage for me so far). If graylog-server figured out elasticsearch was
status "RED", why not shut down entirely so as to not make the situation
any worse, and cause an easier to notice outage?

Also, there's a bug with the elasticsearch rpm's.
/etc/sysconfig/elasticsearch states to not set MAX_OPEN_FILES when using
systemd (which you would be with CentOS7) and to instead set
LimitNOFILE in /usr/lib/systemd/system/elasticsearch.service.
However, /usr/lib/systemd/system/elasticsearch.service is replaced every
time you upgrade elasticsearch. So either their documentation is wrong and
/etc/sysconfig/elasticsearch is what "wins", or their rpm installer is
broken. I'll open a bug report for them (not a graylog issue - but a FYI
for others)


-- 
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +1 408 481 8171
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to graylog2+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/CAFChrgKZYE0bQOfKJfwYAONjYZh%2BrO4R5ir85gTj55m-RffdTA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to