Re: [graylog2] Graylog 1.0 UDP process buffer performance

Bernd Ahlers Tue, 03 Mar 2015 09:11:22 -0800

Thanks for the feedback! :)

Bernd


On 2 March 2015 at 11:26,  <[email protected]> wrote:
> I installed unbound locally and used this, and it seems to have resolved the
> issue. It's odd that the old server didn't show this behavior, but I'm happy
> enough that it's resolved anyway. :)
>
> Regards
> Johan
>
> On Friday, February 27, 2015 at 2:02:08 PM UTC+1, Bernd Ahlers wrote:
>>
>> Johan, Henrik,
>>
>> I tried to track this problem down.The problem is that the JVM does
>> not cache reverse DNS lookups. The available JVM DNS cache settings
>> like "networkaddress.cache.ttl" only affect forward DNS lookups.
>>
>> The code for doing the reverse lookups in Graylog did not change in a
>> long time, so this problem is not new in 1.0.
>>
>> I my test setup enabling "force_rdns" for a syslog input reduced the
>> throughput from around 7000 msg/s to 300 msg/s. This was without a
>> local DNS cache. Once I installed a DNS cache on the Graylog server,
>> the throughput went up to around 3000 msg/s.
>>
>> We will investigate if there is a sane way to cache the reverse
>> lookups ourselves. In the meantime I suggest to test with a DNS cache
>> installed on the Graylog server nodes to see if that helps or to
>> disable the "force_rdns" setting.
>>
>> Regards,
>> Bernd
>>
>> On 25 February 2015 at 18:00, Bernd Ahlers <[email protected]> wrote:
>> > Johan, Henrik,
>> >
>> > thanks for the details. I created an issue on GitHub and will
>> > investigate.
>> >
>> > https://github.com/Graylog2/graylog2-server/issues/999
>> >
>> > Regards,
>> > Bernd
>> >
>> > On 25 February 2015 at 17:48, Henrik Johansen <[email protected]> wrote:
>> >> Bernd,
>> >>
>> >> Correct - that issue started after 0.92.x.
>> >>
>> >> We are still seeing evaluated CPU utilisation but we are attributing
>> >> that
>> >> to the fact that 0.92 was loosing messages in our setup.
>> >>
>> >>
>> >>> On 25 Feb 2015, at 17:37, Bernd Ahlers <[email protected]> wrote:
>> >>>
>> >>> Henrik,
>> >>>
>> >>> uh, okay. I suppose it worked for you in 0.92 as well?
>> >>>
>> >>> I will create an issue on GitHub for that.
>> >>>
>> >>> Bernd
>> >>>
>> >>> On 25 February 2015 at 17:14, Henrik Johansen <[email protected]> wrote:
>> >>>> Bernd,
>> >>>>
>> >>>> We saw the exact same issue - here is a graph over the CPU idle
>> >>>> percentage across a few of the cluster nodes during the upgrade :
>> >>>>
>> >>>> http://5.9.37.177/graylog_cluster_cpu_idle.png
>> >>>>
>> >>>> We went from ~20% CPU utilisation to ~100% CPU utilisation across
>> >>>> ~200 cores and things only settled down after disabling force_rdns.
>> >>>>
>> >>>>
>> >>>> On 25 Feb 2015, at 11:55, Bernd Ahlers <[email protected]> wrote:
>> >>>>
>> >>>> Johan,
>> >>>>
>> >>>> the only thing that changed from 0.92 to 1.0 is that the DNS lookup
>> >>>> is
>> >>>> now done when the messages are read from the journal and not in the
>> >>>> input path where the messages are received. Otherwise, nothing has
>> >>>> changed in that regard.
>> >>>>
>> >>>> We do not do any manual caching of the DNS lookups, but the JVM
>> >>>> caches
>> >>>> them by default. Check
>> >>>>
>> >>>> http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html
>> >>>> for networkaddress.cache.ttl and networkaddress.cache.negative.ttl.
>> >>>>
>> >>>> Regards,
>> >>>> Bernd
>> >>>>
>> >>>> On 25 February 2015 at 08:56,  <[email protected]> wrote:
>> >>>>
>> >>>> This is strange, I went through all of the settings for my reply, and
>> >>>> we are
>> >>>> indeed using rdns, and it seems to be the culprit. The strangeness is
>> >>>> that
>> >>>> it works fine on the old servers even though they're on the same
>> >>>> networks,
>> >>>> and using the same DNS's and resolver settings.
>> >>>> Did something regarding reverse DNS change between 0.92 and 1.0? I'm
>> >>>> thinking perhaps the server is trying to do one lookup per message
>> >>>> instead
>> >>>> of caching reverse lookups, seeing as the latter would result in very
>> >>>> little
>> >>>> DNS traffic since most of the logs will be coming from a small number
>> >>>> of
>> >>>> hosts.
>> >>>>
>> >>>> Regards
>> >>>> Johan
>> >>>>
>> >>>> On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers
>> >>>> wrote:
>> >>>>
>> >>>>
>> >>>> Johan,
>> >>>>
>> >>>> this sounds very strange indeed. Can you provide us with some more
>> >>>> details?
>> >>>>
>> >>>> - What kind of messages are you pouring into Graylog via UDP? (GELF,
>> >>>> raw, syslog?)
>> >>>> - Do you have any extractors or grok filters running for the messages
>> >>>> coming in via UDP?
>> >>>> - Any other differences between the TCP and UDP messages?
>> >>>> - Can you show us your input configuration?
>> >>>> - Are you using reverse DNS lookups?
>> >>>>
>> >>>> Thank you!
>> >>>>
>> >>>> Regards,
>> >>>> Bernd
>> >>>>
>> >>>> On 24 February 2015 at 16:45,  <[email protected]> wrote:
>> >>>>
>> >>>> Well that could be a suspect if it wasn't for the fact that the old
>> >>>> nodes
>> >>>> running on old hardware handle it just fine, along with the fact that
>> >>>> the
>> >>>> traffic seems to reach the nodes just fine(i.e it actually fills the
>> >>>> journal
>> >>>> up just fine, and the input buffer never breaks a sweat). And it's
>> >>>> really
>> >>>> not that much traffic, even spread across four nodes those ~1000
>> >>>> messages
>> >>>> per second will cause this whereas the old nodes are just two and can
>> >>>> handle
>> >>>> it just fine.
>> >>>>
>> >>>> About disk tuning, I haven't done much of that, and I realize I
>> >>>> forgot
>> >>>> to
>> >>>> mention that the Elasticsearch cluster is on separate physical
>> >>>> hardware
>> >>>> so
>> >>>> there's a minuscule amount of disk I/O happening on the Graylog
>> >>>> nodes.
>> >>>>
>> >>>> It's really very strange since it seems like UDP itself isn't to
>> >>>> blame,
>> >>>> after all the messages get into Graylog just fine and fills up the
>> >>>> journal
>> >>>> rapidly. The screenshot from I linked was from after I had stopped
>> >>>> sending
>> >>>> logs, i.e there was no longer any ingress traffic so the Graylog
>> >>>> process
>> >>>> had
>> >>>> nothing to do except emptying it's journal so it should all be
>> >>>> internal
>> >>>> processing and egress traffic to Elasticsearch. And as can be seen in
>> >>>> the
>> >>>> screenshot it seems like it's doing it in small bursts.
>> >>>>
>> >>>> In the exact same scenario(i.e when I just streamed a large file into
>> >>>> the
>> >>>> system as fast as it could receive it) but with the logs having come
>> >>>> over
>> >>>> TCP, it'll still store up a sizable number of messages in the
>> >>>> journal,
>> >>>> but
>> >>>> the processing of the journaled messages is both more even and vastly
>> >>>> faster.
>> >>>>
>> >>>> So in short it doesn't appear to be the communication itself, but
>> >>>> something
>> >>>> happening "inside" the Graylog process, but that only happens when
>> >>>> the
>> >>>> messages have been delivered over UDP.
>> >>>>
>> >>>> Regards
>> >>>> Johan
>> >>>>
>> >>>>
>> >>>> On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen
>> >>>> wrote:
>> >>>>
>> >>>>
>> >>>> Could this simply be because TCP avoids (or tries to avoid)
>> >>>> congestion
>> >>>> while UDP does not?
>> >>>>
>> >>>> /HJ
>> >>>>
>> >>>> On 24 Feb 2015, at 13:50, [email protected] wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> With the release of 1.0 we've started moving towards a new cluster of
>> >>>> GL
>> >>>> hosts. These are working very well, with one exception.
>> >>>> For some reason any reasonably significant UDP traffic will choke the
>> >>>> message processor, fill up and process buffers on all four hosts, and
>> >>>> effectively choke up all other message processing as well.
>> >>>> Normally we do around 2k messages per second, split roughly 50/50
>> >>>> between
>> >>>> TCP and UDP. Sending the entire TCP load to one host doesn't present
>> >>>> a
>> >>>> problem, it doesn't break a sweat.
>> >>>>
>> >>>> I've also experimented a little with sending a large text file using
>> >>>> rsyslog's imfile module, sending it via TCP will bottleneck us at the
>> >>>> ES
>> >>>> side of things and cause the disk journal fill up fairly rapidly, but
>> >>>> it's
>> >>>> still working at at ~9k messages per second so that's fine. Sending
>> >>>> it
>> >>>> via
>> >>>> UDP just causes GL to choke again, fill up the journal to a certain
>> >>>> point
>> >>>> and slowly slowly process the journal at little bursts of a few
>> >>>> thousand
>> >>>> messages followed by several seconds of apparent sleeping(i.e pretty
>> >>>> much no
>> >>>> CPU usage).
>> >>>>
>> >>>> During all of this the input buffer never fills up more than at most
>> >>>> single digit percentages, using TCP the output buffer sometimes moves
>> >>>> up to
>> >>>> 20-30%, with UDP it never moves at all. It's all in the process
>> >>>> buffer.
>> >>>> Sending a large burst of messages and then stopping doesn't seem to
>> >>>> affect
>> >>>> this behavior either, even after the inbound messages stop it still
>> >>>> takes a
>> >>>> long time to process the messages that are already in the journal and
>> >>>> process buffer.
>> >>>> I'm using VisualVM to look at the CPU and memory usage, this is a
>> >>>> screenshot of a UDP session:
>> >>>> http://i59.tinypic.com/x23xfl.png
>> >>>>
>> >>>> I've tried mucking around with various knobs,
>> >>>> processbuffer_processors,
>> >>>> JVM settings, etc, with no results whatsoever, good or bad.
>> >>>> There's nothing to suggest a problem in neither the graylog nor
>> >>>> system
>> >>>> logs.
>> >>>>
>> >>>> Pertinent specs and settings:
>> >>>> ring_size = 16384 (CPU's have 20 MB L3)
>> >>>> processbuffer_processors = 5
>> >>>>
>> >>>> Java 8u31
>> >>>> Using G1GC with StringDeduplication, I've tried without the latter
>> >>>> and
>> >>>> just using CMC as well, no difference.
>> >>>> 4 GB Xmx/Xms.
>> >>>> Linux 3.16.0
>> >>>> net.core.rmem_max = 8388608
>> >>>>
>> >>>> These are virtual machines, VMware, 8 GB / 8 vCPU's, Xeon E5-2690's.
>> >>>>
>> >>>> Software wise the old nodes are running the same setup more or less,
>> >>>> except kernel 3.2.0, same JVM, G1GC, etc. Hardware wise, they're
>> >>>> physical
>> >>>> boxes, old Dell 2950's with dual quad core E5440's. That's Core2 era
>> >>>> so
>> >>>> quite a bit slower.
>> >>>>
>> >>>> Any ideas?
>> >>>>
>> >>>> --
>> >>>> You received this message because you are subscribed to the Google
>> >>>> Groups
>> >>>> "graylog2" group.
>> >>>> To unsubscribe from this group and stop receiving emails from it,
>> >>>> send
>> >>>> an
>> >>>> email to [email protected].
>> >>>> For more options, visit https://groups.google.com/d/optout.
>> >>>>
>> >>>>
>> >>>> --
>> >>>> You received this message because you are subscribed to the Google
>> >>>> Groups
>> >>>> "graylog2" group.
>> >>>> To unsubscribe from this group and stop receiving emails from it,
>> >>>> send
>> >>>> an
>> >>>> email to [email protected].
>> >>>> For more options, visit https://groups.google.com/d/optout.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Developer
>> >>>>
>> >>>> Tel.: +49 (0)40 609 452 077
>> >>>> Fax.: +49 (0)40 609 452 078
>> >>>>
>> >>>> TORCH GmbH - A Graylog company
>> >>>> Steckelhörn 11
>> >>>> 20457 Hamburg
>> >>>> Germany
>> >>>>
>> >>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>> >>>> Geschäftsführer: Lennart Koopmann (CEO)
>> >>>>
>> >>>>
>> >>>> --
>> >>>> You received this message because you are subscribed to the Google
>> >>>> Groups
>> >>>> "graylog2" group.
>> >>>> To unsubscribe from this group and stop receiving emails from it,
>> >>>> send an
>> >>>> email to [email protected].
>> >>>> For more options, visit https://groups.google.com/d/optout.
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Developer
>> >>>>
>> >>>> Tel.: +49 (0)40 609 452 077
>> >>>> Fax.: +49 (0)40 609 452 078
>> >>>>
>> >>>> TORCH GmbH - A Graylog company
>> >>>> Steckelhörn 11
>> >>>> 20457 Hamburg
>> >>>> Germany
>> >>>>
>> >>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>> >>>> Geschäftsführer: Lennart Koopmann (CEO)
>> >>>>
>> >>>> --
>> >>>> You received this message because you are subscribed to the Google
>> >>>> Groups
>> >>>> "graylog2" group.
>> >>>> To unsubscribe from this group and stop receiving emails from it,
>> >>>> send an
>> >>>> email to [email protected].
>> >>>> For more options, visit https://groups.google.com/d/optout.
>> >>>>
>> >>>>
>> >>>> --
>> >>>> You received this message because you are subscribed to the Google
>> >>>> Groups
>> >>>> "graylog2" group.
>> >>>> To unsubscribe from this group and stop receiving emails from it,
>> >>>> send an
>> >>>> email to [email protected].
>> >>>> For more options, visit https://groups.google.com/d/optout.
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Developer
>> >>>
>> >>> Tel.: +49 (0)40 609 452 077
>> >>> Fax.: +49 (0)40 609 452 078
>> >>>
>> >>> TORCH GmbH - A Graylog company
>> >>> Steckelhörn 11
>> >>> 20457 Hamburg
>> >>> Germany
>> >>>
>> >>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>> >>> Geschäftsführer: Lennart Koopmann (CEO)
>> >>>
>> >>> --
>> >>> You received this message because you are subscribed to the Google
>> >>> Groups "graylog2" group.
>> >>> To unsubscribe from this group and stop receiving emails from it, send
>> >>> an email to [email protected].
>> >>> For more options, visit https://groups.google.com/d/optout.
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> >> Groups "graylog2" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> >> an email to [email protected].
>> >> For more options, visit https://groups.google.com/d/optout.
>> >
>> >
>> >
>> > --
>> > Developer
>> >
>> > Tel.: +49 (0)40 609 452 077
>> > Fax.: +49 (0)40 609 452 078
>> >
>> > TORCH GmbH - A Graylog company
>> > Steckelhörn 11
>> > 20457 Hamburg
>> > Germany
>> >
>> > Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>> > Geschäftsführer: Lennart Koopmann (CEO)
>>
>>
>>
>> --
>> Developer
>>
>> Tel.: +49 (0)40 609 452 077
>> Fax.: +49 (0)40 609 452 078
>>
>> TORCH GmbH - A Graylog company
>> Steckelhörn 11
>> 20457 Hamburg
>> Germany
>>
>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
>> Geschäftsführer: Lennart Koopmann (CEO)
>
> --
> You received this message because you are subscribed to the Google Groups
> "graylog2" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.



-- 
Developer

Tel.: +49 (0)40 609 452 077
Fax.: +49 (0)40 609 452 078

TORCH GmbH - A Graylog company
Steckelhörn 11
20457 Hamburg
Germany

Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175
Geschäftsführer: Lennart Koopmann (CEO)

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [graylog2] Graylog 1.0 UDP process buffer performance

Reply via email to