[graylog2] Re: Graylog Failing jvm Allocation Failure [jvm] [graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][old][501][37] duration [38.6s],

Ricardo Ferreira Mon, 12 Sep 2016 14:59:15 -0700

Just refreshing this issue, as when I try to use extractors/grok the node 
simply dies. analysing the GC logs I see there is a lot of full GC taking 
too much and pausing.
Has anyone encountered same issues for environments with more than 40Kmsg/s?


Thanks



On Wednesday, 24 August 2016 21:16:07 UTC+1, Ricardo Ferreira wrote:
>
> So  to give more evidence to the case I let the two nodes to run for a 
> while, attaching server.log and GC logs.
>
> I used GC1 in one of the node with similar results, seems like graylog is 
> not able to keep up more than 50K msg/s.
>
> I'm monitoring graylog with the metrics plugin and ES, so I'm able to 
> provide historical trends if needed be.
> ES does not seem to be the bottleneck nothing on logs nor in the 
> CPU/memory usage
>
> If you guys have any idea on how to proceed would be appreciated
>
> Thanks
> Ricardo
>
> On Wednesday, 24 August 2016 14:12:45 UTC+1, Jochen Schalanda wrote:
>>
>> Hi Ricardo,
>>
>> try configuring *less* heap memory for your JVM, ideally less than 32G. 
>> See 
>> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
>>  
>> for details.
>>
>> Cheers,
>> Jochen
>>
>> On Wednesday, 24 August 2016 15:02:10 UTC+2, Ricardo Ferreira wrote:
>>>
>>> So, we have a 2 identical node graylog setup:
>>>
>>>
>>>  ProLiant DL380 Gen9 (719061-B21) 
>>>  32: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
>>>  128GB RAM
>>>  20MiB L3 cache
>>>  2147479552 bytes (2.1 GB) copied, 12.8211 s, 167 MB/s
>>>
>>> 016-08-24T10:46:34.255Z INFO  [CmdLineTool] Running with JVM arguments: 
>>> -Xms32g -Xmx32g -XX:NewRatio=1 -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC 
>>> -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC 
>>> -XX:-OmitStackTraceInFastThrow -Xloggc:/tmp/memoryGC.log 
>>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
>>> -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml 
>>> -Djava.library.path=/usr/share/graylog-server/lib/sigar 
>>> -Dgraylog2.installation_source=deb
>>>
>>> different from default configs:
>>>
>>>
>>> elasticsearch_shards = 5
>>> elasticsearch_replicas = 1
>>>
>>> processbuffer_processors = 15
>>> outputbuffer_processors = 12
>>> ring_size = 4194304
>>> inputbuffer_ring_size = 4194304
>>> inputbuffer_processors = 6
>>> inputbuffer_wait_strategy = blocking
>>> message_journal_max_size = 30gb
>>> async_eventbus_processors = 4
>>>
>>> So our environment does ~ 30-50GB per hour in this testing phase we are 
>>> at ~5kmsg/s
>>> If you look at the graphs we started doing a run just into one done and 
>>> was changed to another node around 11:50UTC
>>>
>>> While it was running i noticed that the graylog node was de registering 
>>> himself:
>>>
>>> 2016-08-24T10:46:57.254Z WARN  [jvm] 
>>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][8][4] duration 
>>> [3.1s], collections [1]/[3.9s], total [3.1s]/[3.8s], memory 
>>> [9gb]->[2gb]/[30.4gb], all_pools {[young] 
>>> [7.8gb]->[445.1mb]/[12.8gb]}{[survivor] [1.2gb]->[886.4mb]/[1.5gb]}{[old] 
>>> [0b]->[770.9mb]/[16gb]}
>>> 2016-08-24T10:46:57.374Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:22, serverValue:790}] to 172.24.9.59:27017
>>> 2016-08-24T10:46:57.402Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:21, serverValue:789}] to 172.24.9.59:27017
>>> 2016-08-24T10:46:57.416Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:23, serverValue:791}] to 172.24.9.59:27017
>>> 2016-08-24T10:47:03.900Z INFO  [jvm] 
>>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][13][9] duration 
>>> [1.5s], collections [2]/[2.4s], total [1.5s]/[6s], memory 
>>> [7.9gb]->[4.2gb]/[30.4gb], all_pools {[young] 
>>> [5.2gb]->[1.5gb]/[12.8gb]}{[survivor] [1.4gb]->[599.3mb]/[1.5gb]}{[old] 
>>> [1.1gb]->[2.9gb]/[16gb]}
>>> 2016-08-24T10:47:08.949Z INFO  [jvm] 
>>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][17][12] duration 
>>> [1.5s], collections [2]/[2s], total [1.5s]/[7.7s], memory 
>>> [13.1gb]->[5.8gb]/[30.4gb], all_pools {[young] 
>>> [8.9gb]->[910.8mb]/[12.8gb]}{[survivor] [1.2gb]->[102.5mb]/[1.5gb]}{[old] 
>>> [2.9gb]->[4.8gb]/[16gb]}
>>> 2016-08-24T10:47:11.874Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:28, serverValue:794}] to 172.24.9.59:27017
>>> 2016-08-24T10:47:11.870Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:29, serverValue:797}] to 172.24.9.59:27017
>>> 2016-08-24T10:47:11.878Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:24, serverValue:793}] to 172.24.9.59:27017
>>> 2016-08-24T10:47:11.878Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:27, serverValue:796}] to 172.24.9.59:27017
>>> 2016-08-24T10:47:11.878Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:25, serverValue:792}] to 172.24.9.59:27017
>>> 2016-08-24T10:47:11.876Z INFO  [connection] Opened connection 
>>> [connectionId{localValue:26, serverValue:795}] to 172.24.9.59:27017
>>> 2016-08-24T10:47:20.046Z WARN  [jvm] 
>>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][25][19] duration 
>>> [2s], collections [2]/[2.4s], total [2s]/[12.5s], memory 
>>> [18.8gb]->[11.3gb]/[30.4gb], all_pools {[young] 
>>> [9.6gb]->[360.1mb]/[12.8gb]}{[survivor] [1.5gb]->[813.1mb]/[1.5gb]}{[old] 
>>> [7.6gb]->[10.2gb]/[16gb]}
>>> 2016-08-24T10:47:25.065Z WARN  [jvm] 
>>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][29][21]* 
>>> duration [1.3s]*, collections [1]/[2s], total [1.3s]/[14.1s], memory 
>>> [20.3gb]->[13.5gb]/[30.4gb], all_pools {[young] 
>>> [8.4gb]->[161.9mb]/[12.8gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] 
>>> [10.2gb]->[11.8gb]/[16gb]}
>>>
>>>
>>>
>>>
>>>
>>>
>>> *2016-08-24T10:48:02.521Z WARN  [NodePingThread] Did not find meta info 
>>> of this node. Re-registering.nevertheless i had other runs when this 
>>> problem manifests it self:*
>>> 2016-08-22T14:59:48.034Z INFO  [jvm] 
>>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][young][62][18] duration 
>>> [797ms], collections [1]/[1.5s], total [797ms]/[18.1s], memory [
>>> 24.7gb]->[15gb]/[30.4gb], all_pools {[young] 
>>> [10.8gb]->[47.8mb]/[12.8gb]}{[survivor] [1.5gb]->[1.5gb]/[1.5gb]}{[old] 
>>> [12.3gb]->[13.3gb]/[16gb]}
>>>
>>> 2016-08-22T15:03:51.784Z WARN  [KafkaJournal] Committing journal offset 
>>> spins 10 times now, this might be a bug. Continuing to try update.
>>>
>>> *2016-08-22T15:07:08.647Z WARN  [jvm] 
>>> [graylog-1743aa7d-c19e-48fa-89fa-f2e619ec0692] [gc][old][477][7] duration 
>>> [23.9s], collections [1]/[24.7s], total [23.9s]/[25.8s], mem*
>>> *ory [25.2gb]->[11.8gb]/[30.4gb], all_pools {[young] 
>>> [10.2gb]->[32.4mb]/[12.8gb]}{[survivor] [411.3mb]->[0b]/[1.5gb]}{[old] 
>>> [14.6gb]->[11.8gb]/[16gb]}*
>>>
>>> 2016-08-22T21:32:36.887Z WARN  [NodePingThread] Did not find meta info 
>>> of this node. Re-registering.
>>> 2016-08-22T21:33:15.752Z WARN  [NodePingThread] Did not find meta info 
>>> of this node. Re-registering.
>>>
>>> *2016-08-22T21:33:15.752Z WARN  [jvm] 
>>> [graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][old][501][37] duration 
>>> [38.6s], collections [2]/[38.8s], total [38.6s]/[15.8m], memory 
>>> [30.3gb]->[22.8gb]/[30.4gb], all_pools {[young] 
>>> [12.8gb]->[6.8gb]/[12.8gb]}{[survivor] [1.5gb]->[0b]/[1.5gb]}{[old] 
>>> [15.9gb]->[15.9gb]/[16gb]}*
>>>
>>> 2016-08-22T21:33:39.089Z WARN  [NodePingThread] Did not find meta info 
>>> of this node. Re-registering.
>>> 2016-08-22T21:34:27.644Z WARN  [NodePingThread] Did not find meta info 
>>> of this node. Re-registering.
>>>
>>>
>>>
>>> So I tried to gather some GClogs also attached to the post, if i do a 
>>> jstat -gccause on one of the servers
>>> this is what i get:
>>> root@graylog1-cab1:~# jstat -gccause 2626 1000
>>>   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    
>>> FGCT     GCT    LGCC                 GCC
>>>   0.00   7.66  85.47  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.50  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.54  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.55  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.55  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.61  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.65  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.68  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.69  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.72  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.75  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.75  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.84  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.87  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.93  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.93  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  85.97  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  86.00  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  86.00  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  86.07  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  86.11  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>>   0.00   7.66  86.17  89.72  98.31      -    132   27.511     7   
>>> 26.812   54.323 Allocation Failure   No GC
>>> ^Croot@graylog1-cab1:~#
>>>
>>> root@graylog1-cab2:~# jstat -gccause 15247 1000
>>>   S0     S1     E      O      M     CCS    YGC     YGCT    FGC    
>>> FGCT     GCT    LGCC                 GCC
>>>  16.00   0.00   2.34  82.67  97.76      -   3956  209.064    42    
>>> 8.939  218.002 Allocation Failure   No GC
>>>  16.00   0.00  49.51  82.67  97.76      -   3956  209.064    42    
>>> 8.939  218.002 Allocation Failure   No GC
>>>  16.00   0.00 100.00  82.67  97.76      -   3957  209.064    42    
>>> 8.939  218.002 Allocation Failure   Allocation Failure
>>>   0.00  16.63  46.85  82.83  97.76      -   3957  209.175    42    
>>> 8.939  218.113 Allocation Failure   No GC
>>>  18.22  18.59  99.98  83.17  97.76      -   3958  209.175    42    
>>> 8.939  218.113 No GC                GCLocker Initiated GC
>>>  18.22   0.00  52.19  83.01  97.76      -   3958  209.285    42    
>>> 8.939  218.223 GCLocker Initiated GC No GC
>>>   0.00  16.02   2.33  83.19  97.76      -   3959  209.324    42    
>>> 8.939  218.263 Allocation Failure   No GC
>>>   0.00  16.02  50.38  83.19  97.76      -   3959  209.324    42    
>>> 8.939  218.263 Allocation Failure   No GC
>>>   0.00  16.02  97.35  83.19  97.76      -   3959  209.324    42    
>>> 8.939  218.263 Allocation Failure   No GC
>>>  16.15   0.00  46.86  83.36  97.76      -   3960  209.363    42    
>>> 8.939  218.302 Allocation Failure   No GC
>>>  18.10  15.06  99.80  83.56  97.76      -   3961  209.363    42    
>>> 8.939  218.302 No GC                GCLocker Initiated GC
>>>   0.00  18.95  50.47  83.54  97.76      -   3961  209.462    42    
>>> 8.939  218.401 GCLocker Initiated GC No GC
>>>  14.51   0.00   2.39  83.72  97.76      -   3962  209.503    42    
>>> 8.939  218.441 Allocation Failure   No GC
>>>  14.51   0.00  56.86  83.72  97.76      -   3962  209.503    42    
>>> 8.939  218.441 Allocation Failure   No GC
>>>   0.00  16.35   9.47  83.90  97.76      -   3963  209.542    42    
>>> 8.939  218.480 Allocation Failure   No GC
>>>   0.00  16.35  63.29  83.90  97.76      -   3963  209.542    42    
>>> 8.939  218.480 Allocation Failure   No GC
>>>  15.13   0.00  14.54  84.10  97.76      -   3964  209.582    42    
>>> 8.939  218.521 Allocation Failure   No GC
>>>  15.13   0.00  66.56  84.10  97.76      -   3964  209.582    42    
>>> 8.939  218.521 Allocation Failure   No GC
>>>   0.00  16.20  18.57  84.28  97.76      -   3965  209.622    42    
>>> 8.939  218.561 Allocation Failure   No GC
>>>   0.00  16.20  72.64  84.28  97.76      -   3965  209.622    42    
>>> 8.939  218.561 Allocation Failure   No GC
>>>  16.43   0.00  23.21  84.48  97.76      -   3966  209.660    42    
>>> 8.939  218.599 Allocation Failure   No GC
>>>  16.43   0.00  72.23  84.48  97.76      -   3966  209.660    42    
>>> 8.939  218.599 Allocation Failure   No GC
>>>   0.00  13.79  24.92  84.65  97.76      -   3967  209.705    42    
>>> 8.939  218.643 Allocation Failure   No GC
>>>   0.00  13.79  79.81  84.65  97.76      -   3967  209.705    42    
>>> 8.939  218.643 Allocation Failure   No GC
>>>  16.05   0.00  32.03  84.82  97.76      -   3968  209.742    42    
>>> 8.939  218.681 Allocation Failure   No GC
>>>  16.05   0.00  86.74  84.82  97.76      -   3968  209.742    42    
>>> 8.939  218.681 Allocation Failure   No GC
>>>   0.00  15.02  32.71  84.98  97.76      -   3969  209.782    42    
>>> 8.939  218.720 Allocation Failure   No GC
>>> ^Croot@graylog1-cab2:~#
>>>
>>>
>>> After seeing this error, I was thinking that probably the YOUNG ratio in 
>>> graylog was non optimal, but it looks like it is using a 1:1 ratio so it 
>>> has 16GB.
>>>
>>> At this point, I don't know what I could test, any ideas are welcome.
>>>
>>> Thanks
>>> Ricardo
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Graylog Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/graylog2/515dea95-fbfb-451e-9466-df3a9f3972ba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[graylog2] Re: Graylog Failing jvm Allocation Failure [jvm] [graylog-4e9a7285-48ce-468c-8604-6b2bf613eafd] [gc][old][501][37] duration [38.6s],

Reply via email to