Henrik, uh, okay. I suppose it worked for you in 0.92 as well?
I will create an issue on GitHub for that. Bernd On 25 February 2015 at 17:14, Henrik Johansen <[email protected]> wrote: > Bernd, > > We saw the exact same issue - here is a graph over the CPU idle > percentage across a few of the cluster nodes during the upgrade : > > http://5.9.37.177/graylog_cluster_cpu_idle.png > > We went from ~20% CPU utilisation to ~100% CPU utilisation across > ~200 cores and things only settled down after disabling force_rdns. > > > On 25 Feb 2015, at 11:55, Bernd Ahlers <[email protected]> wrote: > > Johan, > > the only thing that changed from 0.92 to 1.0 is that the DNS lookup is > now done when the messages are read from the journal and not in the > input path where the messages are received. Otherwise, nothing has > changed in that regard. > > We do not do any manual caching of the DNS lookups, but the JVM caches > them by default. Check > http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html > for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. > > Regards, > Bernd > > On 25 February 2015 at 08:56, <[email protected]> wrote: > > This is strange, I went through all of the settings for my reply, and we are > indeed using rdns, and it seems to be the culprit. The strangeness is that > it works fine on the old servers even though they're on the same networks, > and using the same DNS's and resolver settings. > Did something regarding reverse DNS change between 0.92 and 1.0? I'm > thinking perhaps the server is trying to do one lookup per message instead > of caching reverse lookups, seeing as the latter would result in very little > DNS traffic since most of the logs will be coming from a small number of > hosts. > > Regards > Johan > > On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote: > > > Johan, > > this sounds very strange indeed. Can you provide us with some more > details? > > - What kind of messages are you pouring into Graylog via UDP? (GELF, > raw, syslog?) > - Do you have any extractors or grok filters running for the messages > coming in via UDP? > - Any other differences between the TCP and UDP messages? > - Can you show us your input configuration? > - Are you using reverse DNS lookups? > > Thank you! > > Regards, > Bernd > > On 24 February 2015 at 16:45, <[email protected]> wrote: > > Well that could be a suspect if it wasn't for the fact that the old > nodes > running on old hardware handle it just fine, along with the fact that > the > traffic seems to reach the nodes just fine(i.e it actually fills the > journal > up just fine, and the input buffer never breaks a sweat). And it's > really > not that much traffic, even spread across four nodes those ~1000 > messages > per second will cause this whereas the old nodes are just two and can > handle > it just fine. > > About disk tuning, I haven't done much of that, and I realize I forgot > to > mention that the Elasticsearch cluster is on separate physical hardware > so > there's a minuscule amount of disk I/O happening on the Graylog nodes. > > It's really very strange since it seems like UDP itself isn't to blame, > after all the messages get into Graylog just fine and fills up the > journal > rapidly. The screenshot from I linked was from after I had stopped > sending > logs, i.e there was no longer any ingress traffic so the Graylog process > had > nothing to do except emptying it's journal so it should all be internal > processing and egress traffic to Elasticsearch. And as can be seen in > the > screenshot it seems like it's doing it in small bursts. > > In the exact same scenario(i.e when I just streamed a large file into > the > system as fast as it could receive it) but with the logs having come > over > TCP, it'll still store up a sizable number of messages in the journal, > but > the processing of the journaled messages is both more even and vastly > faster. > > So in short it doesn't appear to be the communication itself, but > something > happening "inside" the Graylog process, but that only happens when the > messages have been delivered over UDP. > > Regards > Johan > > > On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen > wrote: > > > Could this simply be because TCP avoids (or tries to avoid) congestion > while UDP does not? > > /HJ > > On 24 Feb 2015, at 13:50, [email protected] wrote: > > Hello, > > With the release of 1.0 we've started moving towards a new cluster of > GL > hosts. These are working very well, with one exception. > For some reason any reasonably significant UDP traffic will choke the > message processor, fill up and process buffers on all four hosts, and > effectively choke up all other message processing as well. > Normally we do around 2k messages per second, split roughly 50/50 > between > TCP and UDP. Sending the entire TCP load to one host doesn't present a > problem, it doesn't break a sweat. > > I've also experimented a little with sending a large text file using > rsyslog's imfile module, sending it via TCP will bottleneck us at the > ES > side of things and cause the disk journal fill up fairly rapidly, but > it's > still working at at ~9k messages per second so that's fine. Sending it > via > UDP just causes GL to choke again, fill up the journal to a certain > point > and slowly slowly process the journal at little bursts of a few > thousand > messages followed by several seconds of apparent sleeping(i.e pretty > much no > CPU usage). > > During all of this the input buffer never fills up more than at most > single digit percentages, using TCP the output buffer sometimes moves > up to > 20-30%, with UDP it never moves at all. It's all in the process buffer. > Sending a large burst of messages and then stopping doesn't seem to > affect > this behavior either, even after the inbound messages stop it still > takes a > long time to process the messages that are already in the journal and > process buffer. > I'm using VisualVM to look at the CPU and memory usage, this is a > screenshot of a UDP session: > http://i59.tinypic.com/x23xfl.png > > I've tried mucking around with various knobs, processbuffer_processors, > JVM settings, etc, with no results whatsoever, good or bad. > There's nothing to suggest a problem in neither the graylog nor system > logs. > > Pertinent specs and settings: > ring_size = 16384 (CPU's have 20 MB L3) > processbuffer_processors = 5 > > Java 8u31 > Using G1GC with StringDeduplication, I've tried without the latter and > just using CMC as well, no difference. > 4 GB Xmx/Xms. > Linux 3.16.0 > net.core.rmem_max = 8388608 > > These are virtual machines, VMware, 8 GB / 8 vCPU's, Xeon E5-2690's. > > Software wise the old nodes are running the same setup more or less, > except kernel 3.2.0, same JVM, G1GC, etc. Hardware wise, they're > physical > boxes, old Dell 2950's with dual quad core E5440's. That's Core2 era so > quite a bit slower. > > Any ideas? > > -- > You received this message because you are subscribed to the Google > Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send > an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google > Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send > an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > > > -- > Developer > > Tel.: +49 (0)40 609 452 077 > Fax.: +49 (0)40 609 452 078 > > TORCH GmbH - A Graylog company > Steckelhörn 11 > 20457 Hamburg > Germany > > Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 > Geschäftsführer: Lennart Koopmann (CEO) > > > -- > You received this message because you are subscribed to the Google Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > > > -- > Developer > > Tel.: +49 (0)40 609 452 077 > Fax.: +49 (0)40 609 452 078 > > TORCH GmbH - A Graylog company > Steckelhörn 11 > 20457 Hamburg > Germany > > Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 > Geschäftsführer: Lennart Koopmann (CEO) > > -- > You received this message because you are subscribed to the Google Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. -- Developer Tel.: +49 (0)40 609 452 077 Fax.: +49 (0)40 609 452 078 TORCH GmbH - A Graylog company Steckelhörn 11 20457 Hamburg Germany Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 Geschäftsführer: Lennart Koopmann (CEO) -- You received this message because you are subscribed to the Google Groups "graylog2" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
