Just refreshing this issue, as when I try to use extractors/grok the node simply dies. analysing the GC logs I see there is a lot of full GC taking too much and pausing. Has anyone encountered same issues for environments with more than 40Kmsg/s?
Thanks On Wednesday, 24 August 2016 21:16:07 UTC+1, Ricardo Ferreira wrote: > > So to give more evidence to the case I let the two nodes to run for a > while, attaching server.log and GC logs. > > I used GC1 in one of the node with similar results, seems like graylog is > not able to keep up more than 50K msg/s. > > I'm monitoring graylog with the metrics plugin and ES, so I'm able to > provide historical trends if needed be. > ES does not seem to be the bottleneck nothing on logs nor in the > CPU/memory usage > > If you guys have any idea on how to proceed would be appreciated > > Thanks > Ricardo > > On Wednesday, 24 August 2016 14:12:45 UTC+1, Jochen Schalanda wrote: >> >> Hi Ricardo, >> >> try configuring *less* heap memory for your JVM, ideally less than 32G. >> See >> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/ >> >> for details. >> >> Cheers, >> Jochen >> >> On Wednesday, 24 August 2016 15:02:10 UTC+2, Ricardo Ferreira wrote: >>> >>> So, we have a 2 identical node graylog setup: >>> >>> >>> ProLiant DL380 Gen9 (719061-B21) >>> 32: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz >>> 128GB RAM >>> 20MiB L3 cache >>> 2147479552 bytes (2.1 GB) copied, 12.8211 s, 167 MB/s >>> >>> 016-08-24T10:46:34.255Z INFO [CmdLineTool] Running with JVM arguments: >>> -Xms32g -Xmx32g -XX:NewRatio=1 -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC >>> -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC >>> -XX:-OmitStackTraceInFastThrow -Xloggc:/tmp/memoryGC.log >>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps >>> -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml >>> -Djava.library.path=/usr/share/graylog-server/lib/sigar >>> -Dgraylog2.installation_source=deb >>> >>> different from default configs: >>> >>> >>> elasticsearch_shards = 5 >>> elasticsearch_replicas = 1 >>> >>> processbuffer_processors = 15 >>> outputbuffer_processors = 12 >>> ring_size = 4194304 >>> inputbuffer_ring_size = 4194304 >>> inputbuffer_processors = 6 >>> inputbuffer_wait_strategy = blocking >>> message_journal_max_size = 30gb >>> async_eventbus_processors = 4 >>> >>> So our environment does ~ 30-50GB per hour in this testing phase we are >>> at ~5kmsg/s >>> If you look at the graphs we started doing a run just into one done and >>> was changed to another node around 11:50UTC >>> >>> While it was running i noticed that the graylog node was de registering >>> himself: >>> >>> 2016-08-24T10:46:57.254Z WARN [jvm] >>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][8][4] duration >>> [3.1s], collections [1]/[3.9s], total [3.1s]/[3.8s], memory >>> [9gb]->[2gb]/[30.4gb], all_pools {[young] >>> [7.8gb]->[445.1mb]/[12.8gb]}{[survivor] [1.2gb]->[886.4mb]/[1.5gb]}{[old] >>> [0b]->[770.9mb]/[16gb]} >>> 2016-08-24T10:46:57.374Z INFO [connection] Opened connection >>> [connectionId{localValue:22, serverValue:790}] to 172.24.9.59:27017 >>> 2016-08-24T10:46:57.402Z INFO [connection] Opened connection >>> [connectionId{localValue:21, serverValue:789}] to 172.24.9.59:27017 >>> 2016-08-24T10:46:57.416Z INFO [connection] Opened connection >>> [connectionId{localValue:23, serverValue:791}] to 172.24.9.59:27017 >>> 2016-08-24T10:47:03.900Z INFO [jvm] >>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][13][9] duration >>> [1.5s], collections [2]/[2.4s], total [1.5s]/[6s], memory >>> [7.9gb]->[4.2gb]/[30.4gb], all_pools {[young] >>> [5.2gb]->[1.5gb]/[12.8gb]}{[survivor] [1.4gb]->[599.3mb]/[1.5gb]}{[old] >>> [1.1gb]->[2.9gb]/[16gb]} >>> 2016-08-24T10:47:08.949Z INFO [jvm] >>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][17][12] duration >>> [1.5s], collections [2]/[2s], total [1.5s]/[7.7s], memory >>> [13.1gb]->[5.8gb]/[30.4gb], all_pools {[young] >>> [8.9gb]->[910.8mb]/[12.8gb]}{[survivor] [1.2gb]->[102.5mb]/[1.5gb]}{[old] >>> [2.9gb]->[4.8gb]/[16gb]} >>> 2016-08-24T10:47:11.874Z INFO [connection] Opened connection >>> [connectionId{localValue:28, serverValue:794}] to 172.24.9.59:27017 >>> 2016-08-24T10:47:11.870Z INFO [connection] Opened connection >>> [connectionId{localValue:29, serverValue:797}] to 172.24.9.59:27017 >>> 2016-08-24T10:47:11.878Z INFO [connection] Opened connection >>> [connectionId{localValue:24, serverValue:793}] to 172.24.9.59:27017 >>> 2016-08-24T10:47:11.878Z INFO [connection] Opened connection >>> [connectionId{localValue:27, serverValue:796}] to 172.24.9.59:27017 >>> 2016-08-24T10:47:11.878Z INFO [connection] Opened connection >>> [connectionId{localValue:25, serverValue:792}] to 172.24.9.59:27017 >>> 2016-08-24T10:47:11.876Z INFO [connection] Opened connection >>> [connectionId{localValue:26, serverValue:795}] to 172.24.9.59:27017 >>> 2016-08-24T10:47:20.046Z WARN [jvm] >>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][25][19] duration >>> [2s], collections [2]/[2.4s], total [2s]/[12.5s], memory >>> [18.8gb]->[11.3gb]/[30.4gb], all_pools {[young] >>> [9.6gb]->[360.1mb]/[12.8gb]}{[survivor] [1.5gb]->[813.1mb]/[1.5gb]}{[old] >>> [7.6gb]->[10.2gb]/[16gb]} >>> 2016-08-24T10:47:25.065Z WARN [jvm] >>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][29][21]* >>> duration [1.3s]*, collections [1]/[2s], total [1.3s]/[14.1s], memory >>> [20.3gb]->[13.5gb]/[30.4gb], all_pools {[young] >>> [8.4gb]->[161.9mb]/[12.8gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] >>> [10.2gb]->[11.8gb]/[16gb]} >>> >>> >>> >>> >>> >>> >>> *2016-08-24T10:48:02.521Z WARN [NodePingThread] Did not find meta info >>> of this node. Re-registering.nevertheless i had other runs when this >>> problem manifests it self:* >>> 2016-08-22T14:59:48.034Z INFO [jvm] >>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][62][18] duration >>> [797ms], collections [1]/[1.5s], total [797ms]/[18.1s], memory [ >>> 24.7gb]->[15gb]/[30.4gb], all_pools {[young] >>> [10.8gb]->[47.8mb]/[12.8gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] >>> [12.3gb]->[13.3gb]/[16gb]} >>> >>> 2016-08-22T15:03:51.784Z WARN [KafkaJournal] Committing journal offset >>> spins 10 times now, this might be a bug. Continuing to try update. >>> >>> *2016-08-22T15:07:08.647Z WARN [jvm] >>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][old][477][7] duration >>> [23.9s], collections [1]/[24.7s], total [23.9s]/[25.8s], mem* >>> *ory [25.2gb]->[11.8gb]/[30.4gb], all_pools {[young] >>> [10.2gb]->[32.4mb]/[12.8gb]}{[survivor] [411.3mb]->[0b]/[1.5gb]}{[old] >>> [14.6gb]->[11.8gb]/[16gb]}* >>> >>> 2016-08-22T21:32:36.887Z WARN [NodePingThread] Did not find meta info >>> of this node. Re-registering. >>> 2016-08-22T21:33:15.752Z WARN [NodePingThread] Did not find meta info >>> of this node. Re-registering. >>> >>> *2016-08-22T21:33:15.752Z WARN [jvm] >>> [graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][old][501][37] duration >>> [38.6s], collections [2]/[38.8s], total [38.6s]/[15.8m], memory >>> [30.3gb]->[22.8gb]/[30.4gb], all_pools {[young] >>> [12.8gb]->[6.8gb]/[12.8gb]}{[survivor] [1.5gb]->[0b]/[1.5gb]}{[old] >>> [15.9gb]->[15.9gb]/[16gb]}* >>> >>> 2016-08-22T21:33:39.089Z WARN [NodePingThread] Did not find meta info >>> of this node. Re-registering. >>> 2016-08-22T21:34:27.644Z WARN [NodePingThread] Did not find meta info >>> of this node. Re-registering. >>> >>> >>> >>> So I tried to gather some GClogs also attached to the post, if i do a >>> jstat -gccause on one of the servers >>> this is what i get: >>> root@graylog1-cab1:~# jstat -gccause 2626 1000 >>> S0 S1 E O M CCS YGC YGCT FGC >>> FGCT GCT LGCC GCC >>> 0.00 7.66 85.47 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.50 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.54 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.55 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.55 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.61 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.65 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.68 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.69 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.72 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.75 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.75 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.84 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.87 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.93 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.93 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 85.97 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 86.00 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 86.00 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 86.07 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 86.11 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> 0.00 7.66 86.17 89.72 98.31 - 132 27.511 7 >>> 26.812 54.323 Allocation Failure No GC >>> ^Croot@graylog1-cab1:~# >>> >>> root@graylog1-cab2:~# jstat -gccause 15247 1000 >>> S0 S1 E O M CCS YGC YGCT FGC >>> FGCT GCT LGCC GCC >>> 16.00 0.00 2.34 82.67 97.76 - 3956 209.064 42 >>> 8.939 218.002 Allocation Failure No GC >>> 16.00 0.00 49.51 82.67 97.76 - 3956 209.064 42 >>> 8.939 218.002 Allocation Failure No GC >>> 16.00 0.00 100.00 82.67 97.76 - 3957 209.064 42 >>> 8.939 218.002 Allocation Failure Allocation Failure >>> 0.00 16.63 46.85 82.83 97.76 - 3957 209.175 42 >>> 8.939 218.113 Allocation Failure No GC >>> 18.22 18.59 99.98 83.17 97.76 - 3958 209.175 42 >>> 8.939 218.113 No GC GCLocker Initiated GC >>> 18.22 0.00 52.19 83.01 97.76 - 3958 209.285 42 >>> 8.939 218.223 GCLocker Initiated GC No GC >>> 0.00 16.02 2.33 83.19 97.76 - 3959 209.324 42 >>> 8.939 218.263 Allocation Failure No GC >>> 0.00 16.02 50.38 83.19 97.76 - 3959 209.324 42 >>> 8.939 218.263 Allocation Failure No GC >>> 0.00 16.02 97.35 83.19 97.76 - 3959 209.324 42 >>> 8.939 218.263 Allocation Failure No GC >>> 16.15 0.00 46.86 83.36 97.76 - 3960 209.363 42 >>> 8.939 218.302 Allocation Failure No GC >>> 18.10 15.06 99.80 83.56 97.76 - 3961 209.363 42 >>> 8.939 218.302 No GC GCLocker Initiated GC >>> 0.00 18.95 50.47 83.54 97.76 - 3961 209.462 42 >>> 8.939 218.401 GCLocker Initiated GC No GC >>> 14.51 0.00 2.39 83.72 97.76 - 3962 209.503 42 >>> 8.939 218.441 Allocation Failure No GC >>> 14.51 0.00 56.86 83.72 97.76 - 3962 209.503 42 >>> 8.939 218.441 Allocation Failure No GC >>> 0.00 16.35 9.47 83.90 97.76 - 3963 209.542 42 >>> 8.939 218.480 Allocation Failure No GC >>> 0.00 16.35 63.29 83.90 97.76 - 3963 209.542 42 >>> 8.939 218.480 Allocation Failure No GC >>> 15.13 0.00 14.54 84.10 97.76 - 3964 209.582 42 >>> 8.939 218.521 Allocation Failure No GC >>> 15.13 0.00 66.56 84.10 97.76 - 3964 209.582 42 >>> 8.939 218.521 Allocation Failure No GC >>> 0.00 16.20 18.57 84.28 97.76 - 3965 209.622 42 >>> 8.939 218.561 Allocation Failure No GC >>> 0.00 16.20 72.64 84.28 97.76 - 3965 209.622 42 >>> 8.939 218.561 Allocation Failure No GC >>> 16.43 0.00 23.21 84.48 97.76 - 3966 209.660 42 >>> 8.939 218.599 Allocation Failure No GC >>> 16.43 0.00 72.23 84.48 97.76 - 3966 209.660 42 >>> 8.939 218.599 Allocation Failure No GC >>> 0.00 13.79 24.92 84.65 97.76 - 3967 209.705 42 >>> 8.939 218.643 Allocation Failure No GC >>> 0.00 13.79 79.81 84.65 97.76 - 3967 209.705 42 >>> 8.939 218.643 Allocation Failure No GC >>> 16.05 0.00 32.03 84.82 97.76 - 3968 209.742 42 >>> 8.939 218.681 Allocation Failure No GC >>> 16.05 0.00 86.74 84.82 97.76 - 3968 209.742 42 >>> 8.939 218.681 Allocation Failure No GC >>> 0.00 15.02 32.71 84.98 97.76 - 3969 209.782 42 >>> 8.939 218.720 Allocation Failure No GC >>> ^Croot@graylog1-cab2:~# >>> >>> >>> After seeing this error, I was thinking that probably the YOUNG ratio in >>> graylog was non optimal, but it looks like it is using a 1:1 ratio so it >>> has 16GB. >>> >>> At this point, I don't know what I could test, any ideas are welcome. >>> >>> Thanks >>> Ricardo >>> >>> -- You received this message because you are subscribed to the Google Groups "Graylog Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/graylog2/515dea95-fbfb-451e-9466-df3a9f3972ba%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
