Re: [graylog2] Graylog 1.0 UDP process buffer performance

Henrik Johansen Wed, 25 Feb 2015 08:15:51 -0800

Bernd,

We saw the exact same issue - here is a graph over the CPU idle 
percentage across a few of the cluster nodes during the upgrade :


http://5.9.37.177/graylog_cluster_cpu_idle.png 
<http://5.9.37.177/graylog_cluster_cpu_idle.png>

We went from ~20% CPU utilisation to ~100% CPU utilisation across
~200 cores and things only settled down after disabling force_rdns.


> On 25 Feb 2015, at 11:55, Bernd Ahlers <[email protected]> wrote:
> 
> Johan,
> 
> the only thing that changed from 0.92 to 1.0 is that the DNS lookup is
> now done when the messages are read from the journal and not in the
> input path where the messages are received. Otherwise, nothing has
> changed in that regard.
> 
> We do not do any manual caching of the DNS lookups, but the JVM caches
> them by default. Check
> http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
> for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.
> 
> Regards,
> Bernd
> 
> On 25 February 2015 at 08:56,  <[email protected]> wrote:
>> This is strange, I went through all of the settings for my reply, and we are
>> indeed using rdns, and it seems to be the culprit. The strangeness is that
>> it works fine on the old servers even though they're on the same networks,
>> and using the same DNS's and resolver settings.
>> Did something regarding reverse DNS change between 0.92 and 1.0? I'm
>> thinking perhaps the server is trying to do one lookup per message instead
>> of caching reverse lookups, seeing as the latter would result in very little
>> DNS traffic since most of the logs will be coming from a small number of
>> hosts.
>> 
>> Regards
>> Johan
>> 
>> On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers wrote:
>>> 
>>> Johan,
>>> 
>>> this sounds very strange indeed. Can you provide us with some more
>>> details?
>>> 
>>> - What kind of messages are you pouring into Graylog via UDP? (GELF,
>>> raw, syslog?)
>>> - Do you have any extractors or grok filters running for the messages
>>> coming in via UDP?
>>> - Any other differences between the TCP and UDP messages?
>>> - Can you show us your input configuration?
>>> - Are you using reverse DNS lookups?
>>> 
>>> Thank you!
>>> 
>>> Regards,
>>> Bernd
>>> 
>>> On 24 February 2015 at 16:45,  <[email protected]> wrote:
>>>> Well that could be a suspect if it wasn't for the fact that the old
>>>> nodes
>>>> running on old hardware handle it just fine, along with the fact that
>>>> the
>>>> traffic seems to reach the nodes just fine(i.e it actually fills the
>>>> journal
>>>> up just fine, and the input buffer never breaks a sweat). And it's
>>>> really
>>>> not that much traffic, even spread across four nodes those ~1000
>>>> messages
>>>> per second will cause this whereas the old nodes are just two and can
>>>> handle
>>>> it just fine.
>>>> 
>>>> About disk tuning, I haven't done much of that, and I realize I forgot
>>>> to
>>>> mention that the Elasticsearch cluster is on separate physical hardware
>>>> so
>>>> there's a minuscule amount of disk I/O happening on the Graylog nodes.
>>>> 
>>>> It's really very strange since it seems like UDP itself isn't to blame,
>>>> after all the messages get into Graylog just fine and fills up the
>>>> journal
>>>> rapidly. The screenshot from I linked was from after I had stopped
>>>> sending
>>>> logs, i.e there was no longer any ingress traffic so the Graylog process
>>>> had
>>>> nothing to do except emptying it's journal so it should all be internal
>>>> processing and egress traffic to Elasticsearch. And as can be seen in
>>>> the
>>>> screenshot it seems like it's doing it in small bursts.
>>>> 
>>>> In the exact same scenario(i.e when I just streamed a large file into
>>>> the
>>>> system as fast as it could receive it) but with the logs having come
>>>> over
>>>> TCP, it'll still store up a sizable number of messages in the journal,
>>>> but
>>>> the processing of the journaled messages is both more even and vastly
>>>> faster.
>>>> 
>>>> So in short it doesn't appear to be the communication itself, but
>>>> something
>>>> happening "inside" the Graylog process, but that only happens when the
>>>> messages have been delivered over UDP.
>>>> 
>>>> Regards
>>>> Johan
>>>> 
>>>> 
>>>> On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen
>>>> wrote:
>>>>> 
>>>>> Could this simply be because TCP avoids (or tries to avoid) congestion
>>>>> while UDP does not?
>>>>> 
>>>>> /HJ
>>>>> 
>>>>> On 24 Feb 2015, at 13:50, [email protected] wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> With the release of 1.0 we've started moving towards a new cluster of
>>>>> GL
>>>>> hosts. These are working very well, with one exception.
>>>>> For some reason any reasonably significant UDP traffic will choke the
>>>>> message processor, fill up and process buffers on all four hosts, and
>>>>> effectively choke up all other message processing as well.
>>>>> Normally we do around 2k messages per second, split roughly 50/50
>>>>> between
>>>>> TCP and UDP. Sending the entire TCP load to one host doesn't present a
>>>>> problem, it doesn't break a sweat.
>>>>> 
>>>>> I've also experimented a little with sending a large text file using
>>>>> rsyslog's imfile module, sending it via TCP will bottleneck us at the
>>>>> ES
>>>>> side of things and cause the disk journal fill up fairly rapidly, but
>>>>> it's
>>>>> still working at at ~9k messages per second so that's fine. Sending it
>>>>> via
>>>>> UDP just causes GL to choke again, fill up the journal to a certain
>>>>> point
>>>>> and slowly slowly process the journal at little bursts of a few
>>>>> thousand
>>>>> messages followed by several seconds of apparent sleeping(i.e pretty
>>>>> much no
>>>>> CPU usage).
>>>>> 
>>>>> During all of this the input buffer never fills up more than at most
>>>>> single digit percentages, using TCP the output buffer sometimes moves
>>>>> up to
>>>>> 20-30%, with UDP it never moves at all. It's all in the process buffer.
>>>>> Sending a large burst of messages and then stopping doesn't seem to
>>>>> affect
>>>>> this behavior either, even after the inbound messages stop it still
>>>>> takes a
>>>>> long time to process the messages that are already in the journal and
>>>>> process buffer.
>>>>> I'm using VisualVM to look at the CPU and memory usage, this is a
>>>>> screenshot of a UDP session:
>>>>> http://i59.tinypic.com/x23xfl.png
>>>>> 
>>>>> I've tried mucking around with various knobs, processbuffer_processors,
>>>>> JVM settings, etc, with no results whatsoever, good or bad.
>>>>> There's nothing to suggest a problem in neither the graylog nor system
>>>>> logs.
>>>>> 
>>>>> Pertinent specs and settings:
>>>>> ring_size = 16384 (CPU's have 20 MB L3)
>>>>> processbuffer_processors = 5
>>>>> 
>>>>> Java 8u31
>>>>> Using G1GC with StringDeduplication, I've tried without the latter and
>>>>> just using CMC as well, no difference.
>>>>> 4 GB Xmx/Xms.
>>>>> Linux 3.16.0
>>>>> net.core.rmem_max = 8388608
>>>>> 
>>>>> These are virtual machines, VMware, 8 GB / 8 vCPU's, Xeon E5-2690's.
>>>>> 
>>>>> Software wise the old nodes are running the same setup more or less,
>>>>> except kernel 3.2.0, same JVM, G1GC, etc. Hardware wise, they're
>>>>> physical
>>>>> boxes, old Dell 2950's with dual quad core E5440's. That's Core2 era so
>>>>> quite a bit slower.
>>>>> 
>>>>> Any ideas?
>>>>> 
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups
>>>>> "graylog2" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an
>>>>> email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>> 
>>>>> 
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups
>>>> "graylog2" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an
>>>> email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>> 
>>> 
>>> 
>>> --
>>> Developer
>>> 
>>> Tel.: +49 (0)40 609 452 077
>>> Fax.: +49 (0)40 609 452 078
>>> 
>>> TORCH GmbH - A Graylog company
>>> Steckelhörn 11
>>> 20457 Hamburg
>>> Germany
>>> 
>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>>> Geschäftsführer: Lennart Koopmann (CEO)
>> 
>> --
>> You received this message because you are subscribed to the Google Groups
>> "graylog2" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
> 
> 
> 
> -- 
> Developer
> 
> Tel.: +49 (0)40 609 452 077
> Fax.: +49 (0)40 609 452 078
> 
> TORCH GmbH - A Graylog company
> Steckelhörn 11
> 20457 Hamburg
> Germany
> 
> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
> Geschäftsführer: Lennart Koopmann (CEO)
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "graylog2" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] Graylog 1.0 UDP process buffer performance

Reply via email to