Thanks for the feedback! :) Bernd
On 2 March 2015 at 11:26, <[email protected]> wrote: > I installed unbound locally and used this, and it seems to have resolved the > issue. It's odd that the old server didn't show this behavior, but I'm happy > enough that it's resolved anyway. :) > > Regards > Johan > > On Friday, February 27, 2015 at 2:02:08 PM UTC+1, Bernd Ahlers wrote: >> >> Johan, Henrik, >> >> I tried to track this problem down.The problem is that the JVM does >> not cache reverse DNS lookups. The available JVM DNS cache settings >> like "networkaddress.cache.ttl" only affect forward DNS lookups. >> >> The code for doing the reverse lookups in Graylog did not change in a >> long time, so this problem is not new in 1.0. >> >> I my test setup enabling "force_rdns" for a syslog input reduced the >> throughput from around 7000 msg/s to 300 msg/s. This was without a >> local DNS cache. Once I installed a DNS cache on the Graylog server, >> the throughput went up to around 3000 msg/s. >> >> We will investigate if there is a sane way to cache the reverse >> lookups ourselves. In the meantime I suggest to test with a DNS cache >> installed on the Graylog server nodes to see if that helps or to >> disable the "force_rdns" setting. >> >> Regards, >> Bernd >> >> On 25 February 2015 at 18:00, Bernd Ahlers <[email protected]> wrote: >> > Johan, Henrik, >> > >> > thanks for the details. I created an issue on GitHub and will >> > investigate. >> > >> > https://github.com/Graylog2/graylog2-server/issues/999 >> > >> > Regards, >> > Bernd >> > >> > On 25 February 2015 at 17:48, Henrik Johansen <[email protected]> wrote: >> >> Bernd, >> >> >> >> Correct - that issue started after 0.92.x. >> >> >> >> We are still seeing evaluated CPU utilisation but we are attributing >> >> that >> >> to the fact that 0.92 was loosing messages in our setup. >> >> >> >> >> >>> On 25 Feb 2015, at 17:37, Bernd Ahlers <[email protected]> wrote: >> >>> >> >>> Henrik, >> >>> >> >>> uh, okay. I suppose it worked for you in 0.92 as well? >> >>> >> >>> I will create an issue on GitHub for that. >> >>> >> >>> Bernd >> >>> >> >>> On 25 February 2015 at 17:14, Henrik Johansen <[email protected]> wrote: >> >>>> Bernd, >> >>>> >> >>>> We saw the exact same issue - here is a graph over the CPU idle >> >>>> percentage across a few of the cluster nodes during the upgrade : >> >>>> >> >>>> http://5.9.37.177/graylog_cluster_cpu_idle.png >> >>>> >> >>>> We went from ~20% CPU utilisation to ~100% CPU utilisation across >> >>>> ~200 cores and things only settled down after disabling force_rdns. >> >>>> >> >>>> >> >>>> On 25 Feb 2015, at 11:55, Bernd Ahlers <[email protected]> wrote: >> >>>> >> >>>> Johan, >> >>>> >> >>>> the only thing that changed from 0.92 to 1.0 is that the DNS lookup >> >>>> is >> >>>> now done when the messages are read from the journal and not in the >> >>>> input path where the messages are received. Otherwise, nothing has >> >>>> changed in that regard. >> >>>> >> >>>> We do not do any manual caching of the DNS lookups, but the JVM >> >>>> caches >> >>>> them by default. Check >> >>>> >> >>>> http://docs.oracle.com/javase/7/docs/technotes/guides/net/properties.html >> >>>> for networkaddress.cache.ttl and networkaddress.cache.negative.ttl. >> >>>> >> >>>> Regards, >> >>>> Bernd >> >>>> >> >>>> On 25 February 2015 at 08:56, <[email protected]> wrote: >> >>>> >> >>>> This is strange, I went through all of the settings for my reply, and >> >>>> we are >> >>>> indeed using rdns, and it seems to be the culprit. The strangeness is >> >>>> that >> >>>> it works fine on the old servers even though they're on the same >> >>>> networks, >> >>>> and using the same DNS's and resolver settings. >> >>>> Did something regarding reverse DNS change between 0.92 and 1.0? I'm >> >>>> thinking perhaps the server is trying to do one lookup per message >> >>>> instead >> >>>> of caching reverse lookups, seeing as the latter would result in very >> >>>> little >> >>>> DNS traffic since most of the logs will be coming from a small number >> >>>> of >> >>>> hosts. >> >>>> >> >>>> Regards >> >>>> Johan >> >>>> >> >>>> On Tuesday, February 24, 2015 at 5:08:54 PM UTC+1, Bernd Ahlers >> >>>> wrote: >> >>>> >> >>>> >> >>>> Johan, >> >>>> >> >>>> this sounds very strange indeed. Can you provide us with some more >> >>>> details? >> >>>> >> >>>> - What kind of messages are you pouring into Graylog via UDP? (GELF, >> >>>> raw, syslog?) >> >>>> - Do you have any extractors or grok filters running for the messages >> >>>> coming in via UDP? >> >>>> - Any other differences between the TCP and UDP messages? >> >>>> - Can you show us your input configuration? >> >>>> - Are you using reverse DNS lookups? >> >>>> >> >>>> Thank you! >> >>>> >> >>>> Regards, >> >>>> Bernd >> >>>> >> >>>> On 24 February 2015 at 16:45, <[email protected]> wrote: >> >>>> >> >>>> Well that could be a suspect if it wasn't for the fact that the old >> >>>> nodes >> >>>> running on old hardware handle it just fine, along with the fact that >> >>>> the >> >>>> traffic seems to reach the nodes just fine(i.e it actually fills the >> >>>> journal >> >>>> up just fine, and the input buffer never breaks a sweat). And it's >> >>>> really >> >>>> not that much traffic, even spread across four nodes those ~1000 >> >>>> messages >> >>>> per second will cause this whereas the old nodes are just two and can >> >>>> handle >> >>>> it just fine. >> >>>> >> >>>> About disk tuning, I haven't done much of that, and I realize I >> >>>> forgot >> >>>> to >> >>>> mention that the Elasticsearch cluster is on separate physical >> >>>> hardware >> >>>> so >> >>>> there's a minuscule amount of disk I/O happening on the Graylog >> >>>> nodes. >> >>>> >> >>>> It's really very strange since it seems like UDP itself isn't to >> >>>> blame, >> >>>> after all the messages get into Graylog just fine and fills up the >> >>>> journal >> >>>> rapidly. The screenshot from I linked was from after I had stopped >> >>>> sending >> >>>> logs, i.e there was no longer any ingress traffic so the Graylog >> >>>> process >> >>>> had >> >>>> nothing to do except emptying it's journal so it should all be >> >>>> internal >> >>>> processing and egress traffic to Elasticsearch. And as can be seen in >> >>>> the >> >>>> screenshot it seems like it's doing it in small bursts. >> >>>> >> >>>> In the exact same scenario(i.e when I just streamed a large file into >> >>>> the >> >>>> system as fast as it could receive it) but with the logs having come >> >>>> over >> >>>> TCP, it'll still store up a sizable number of messages in the >> >>>> journal, >> >>>> but >> >>>> the processing of the journaled messages is both more even and vastly >> >>>> faster. >> >>>> >> >>>> So in short it doesn't appear to be the communication itself, but >> >>>> something >> >>>> happening "inside" the Graylog process, but that only happens when >> >>>> the >> >>>> messages have been delivered over UDP. >> >>>> >> >>>> Regards >> >>>> Johan >> >>>> >> >>>> >> >>>> On Tuesday, February 24, 2015 at 3:07:47 PM UTC+1, Henrik Johansen >> >>>> wrote: >> >>>> >> >>>> >> >>>> Could this simply be because TCP avoids (or tries to avoid) >> >>>> congestion >> >>>> while UDP does not? >> >>>> >> >>>> /HJ >> >>>> >> >>>> On 24 Feb 2015, at 13:50, [email protected] wrote: >> >>>> >> >>>> Hello, >> >>>> >> >>>> With the release of 1.0 we've started moving towards a new cluster of >> >>>> GL >> >>>> hosts. These are working very well, with one exception. >> >>>> For some reason any reasonably significant UDP traffic will choke the >> >>>> message processor, fill up and process buffers on all four hosts, and >> >>>> effectively choke up all other message processing as well. >> >>>> Normally we do around 2k messages per second, split roughly 50/50 >> >>>> between >> >>>> TCP and UDP. Sending the entire TCP load to one host doesn't present >> >>>> a >> >>>> problem, it doesn't break a sweat. >> >>>> >> >>>> I've also experimented a little with sending a large text file using >> >>>> rsyslog's imfile module, sending it via TCP will bottleneck us at the >> >>>> ES >> >>>> side of things and cause the disk journal fill up fairly rapidly, but >> >>>> it's >> >>>> still working at at ~9k messages per second so that's fine. Sending >> >>>> it >> >>>> via >> >>>> UDP just causes GL to choke again, fill up the journal to a certain >> >>>> point >> >>>> and slowly slowly process the journal at little bursts of a few >> >>>> thousand >> >>>> messages followed by several seconds of apparent sleeping(i.e pretty >> >>>> much no >> >>>> CPU usage). >> >>>> >> >>>> During all of this the input buffer never fills up more than at most >> >>>> single digit percentages, using TCP the output buffer sometimes moves >> >>>> up to >> >>>> 20-30%, with UDP it never moves at all. It's all in the process >> >>>> buffer. >> >>>> Sending a large burst of messages and then stopping doesn't seem to >> >>>> affect >> >>>> this behavior either, even after the inbound messages stop it still >> >>>> takes a >> >>>> long time to process the messages that are already in the journal and >> >>>> process buffer. >> >>>> I'm using VisualVM to look at the CPU and memory usage, this is a >> >>>> screenshot of a UDP session: >> >>>> http://i59.tinypic.com/x23xfl.png >> >>>> >> >>>> I've tried mucking around with various knobs, >> >>>> processbuffer_processors, >> >>>> JVM settings, etc, with no results whatsoever, good or bad. >> >>>> There's nothing to suggest a problem in neither the graylog nor >> >>>> system >> >>>> logs. >> >>>> >> >>>> Pertinent specs and settings: >> >>>> ring_size = 16384 (CPU's have 20 MB L3) >> >>>> processbuffer_processors = 5 >> >>>> >> >>>> Java 8u31 >> >>>> Using G1GC with StringDeduplication, I've tried without the latter >> >>>> and >> >>>> just using CMC as well, no difference. >> >>>> 4 GB Xmx/Xms. >> >>>> Linux 3.16.0 >> >>>> net.core.rmem_max = 8388608 >> >>>> >> >>>> These are virtual machines, VMware, 8 GB / 8 vCPU's, Xeon E5-2690's. >> >>>> >> >>>> Software wise the old nodes are running the same setup more or less, >> >>>> except kernel 3.2.0, same JVM, G1GC, etc. Hardware wise, they're >> >>>> physical >> >>>> boxes, old Dell 2950's with dual quad core E5440's. That's Core2 era >> >>>> so >> >>>> quite a bit slower. >> >>>> >> >>>> Any ideas? >> >>>> >> >>>> -- >> >>>> You received this message because you are subscribed to the Google >> >>>> Groups >> >>>> "graylog2" group. >> >>>> To unsubscribe from this group and stop receiving emails from it, >> >>>> send >> >>>> an >> >>>> email to [email protected]. >> >>>> For more options, visit https://groups.google.com/d/optout. >> >>>> >> >>>> >> >>>> -- >> >>>> You received this message because you are subscribed to the Google >> >>>> Groups >> >>>> "graylog2" group. >> >>>> To unsubscribe from this group and stop receiving emails from it, >> >>>> send >> >>>> an >> >>>> email to [email protected]. >> >>>> For more options, visit https://groups.google.com/d/optout. >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Developer >> >>>> >> >>>> Tel.: +49 (0)40 609 452 077 >> >>>> Fax.: +49 (0)40 609 452 078 >> >>>> >> >>>> TORCH GmbH - A Graylog company >> >>>> Steckelhörn 11 >> >>>> 20457 Hamburg >> >>>> Germany >> >>>> >> >>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 >> >>>> Geschäftsführer: Lennart Koopmann (CEO) >> >>>> >> >>>> >> >>>> -- >> >>>> You received this message because you are subscribed to the Google >> >>>> Groups >> >>>> "graylog2" group. >> >>>> To unsubscribe from this group and stop receiving emails from it, >> >>>> send an >> >>>> email to [email protected]. >> >>>> For more options, visit https://groups.google.com/d/optout. >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Developer >> >>>> >> >>>> Tel.: +49 (0)40 609 452 077 >> >>>> Fax.: +49 (0)40 609 452 078 >> >>>> >> >>>> TORCH GmbH - A Graylog company >> >>>> Steckelhörn 11 >> >>>> 20457 Hamburg >> >>>> Germany >> >>>> >> >>>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 >> >>>> Geschäftsführer: Lennart Koopmann (CEO) >> >>>> >> >>>> -- >> >>>> You received this message because you are subscribed to the Google >> >>>> Groups >> >>>> "graylog2" group. >> >>>> To unsubscribe from this group and stop receiving emails from it, >> >>>> send an >> >>>> email to [email protected]. >> >>>> For more options, visit https://groups.google.com/d/optout. >> >>>> >> >>>> >> >>>> -- >> >>>> You received this message because you are subscribed to the Google >> >>>> Groups >> >>>> "graylog2" group. >> >>>> To unsubscribe from this group and stop receiving emails from it, >> >>>> send an >> >>>> email to [email protected]. >> >>>> For more options, visit https://groups.google.com/d/optout. >> >>> >> >>> >> >>> >> >>> -- >> >>> Developer >> >>> >> >>> Tel.: +49 (0)40 609 452 077 >> >>> Fax.: +49 (0)40 609 452 078 >> >>> >> >>> TORCH GmbH - A Graylog company >> >>> Steckelhörn 11 >> >>> 20457 Hamburg >> >>> Germany >> >>> >> >>> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 >> >>> Geschäftsführer: Lennart Koopmann (CEO) >> >>> >> >>> -- >> >>> You received this message because you are subscribed to the Google >> >>> Groups "graylog2" group. >> >>> To unsubscribe from this group and stop receiving emails from it, send >> >>> an email to [email protected]. >> >>> For more options, visit https://groups.google.com/d/optout. >> >> >> >> -- >> >> You received this message because you are subscribed to the Google >> >> Groups "graylog2" group. >> >> To unsubscribe from this group and stop receiving emails from it, send >> >> an email to [email protected]. >> >> For more options, visit https://groups.google.com/d/optout. >> > >> > >> > >> > -- >> > Developer >> > >> > Tel.: +49 (0)40 609 452 077 >> > Fax.: +49 (0)40 609 452 078 >> > >> > TORCH GmbH - A Graylog company >> > Steckelhörn 11 >> > 20457 Hamburg >> > Germany >> > >> > Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 >> > Geschäftsführer: Lennart Koopmann (CEO) >> >> >> >> -- >> Developer >> >> Tel.: +49 (0)40 609 452 077 >> Fax.: +49 (0)40 609 452 078 >> >> TORCH GmbH - A Graylog company >> Steckelhörn 11 >> 20457 Hamburg >> Germany >> >> Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 >> Geschäftsführer: Lennart Koopmann (CEO) > > -- > You received this message because you are subscribed to the Google Groups > "graylog2" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. -- Developer Tel.: +49 (0)40 609 452 077 Fax.: +49 (0)40 609 452 078 TORCH GmbH - A Graylog company Steckelhörn 11 20457 Hamburg Germany Commercial Reg. (Registergericht): Amtsgericht Hamburg, HRB 125175 Geschäftsführer: Lennart Koopmann (CEO) -- You received this message because you are subscribed to the Google Groups "graylog2" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
