[graylog2] Re: 20.1 - New feature reveals Rejected Execution Queues

Scotty H Tue, 25 Feb 2014 10:18:18 -0800

Now, how do I clear this indexer failure log?

On Tuesday, February 25, 2014 1:14:27 PM UTC-5, Scotty H wrote:
>
> The indexer failures section is a welcome addition - as I'm greeted with 
> quite a few thousand (upwards of 50K after about 30 minutes) of these 
> messages:
>
> RemoteTransportException[[Suicide][inet[/1[ipaddress]:9300]][bulk/shard]]; 
>> nested: EsRejectedExecutionException[rejected execution (queue capacity 50) 
>> on 
>> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@2b5b09fb];
>
>
> At any given time, my Graylog2 box is processing anywhere from 2,500 - 
> 5,500 messages/sec, with occasional spikes of 7K/sec. Right now I have 3.4 
> billion messages, totaling to about 1.8TB.
> I increased my shard count from 4 to 25, restarted, and cycled the 
> deflector: That didn't seem to help. I found a thread speaking of thread 
> count and queue size increases and decided to try that:
>
> http://elasticsearch-users.115913.n3.nabble.com/Understanding-Threadpools-td4028445.html
>
> So here's my custom elasticsearch performance vars out of my configuration 
> (NOT in graylog2's configuration) (Some of these are not really needed, but 
> I have so much memory to work with it doesn't matter):
>
> indices.memory.index_buffer_size: 30%
>> indices.memory.min_shard_index_buffer_size: 12mb
>> indices.memory.min_index_buffer_size: 96mb
>> index.refresh_interval: 30s
>> index.translog.flush_threshold_ops: 5000
>> threadpool.bulk.queue_size: 500
>
>
>
> The relevant change to stop the rejections was increasing my threadpool 
> bulk queue_size to 500. The default is 50. I was still getting an 
> occasional queue-full rejection at 200. I could set to -1 to have it 
> unbounded, but I feel like that's not a good practice. The bulk tasks seem 
> to complete within milliseconds, but there are enough of them being 
> instantiated at the same time for it to to fill up the queue when 50 is 
> simply too small.
>
> After about 15 minutes, here are my cluster stats where I happened to 
> capture some active bulk queues. After a 1/4 second, the queues are empty 
> again:
>
>
>  
>
>> http://ipaddress:9200/_nodes/thread_pool/stats?pretty=true
>>
>> {
>>   "cluster_name" : "graylog2",
>>   "nodes" : {
>>     "c-9rpgQTQI68r91PicxmzA" : {
>>       "timestamp" : 1393351331175,
>>       "name" : "graylog2-server",
>>       "transport_address" : "inet[/XXXXXXXX]",
>>       "hostname" : "XXXXXXXXXX",
>>       "attributes" : {
>>         "client" : "true",
>>         "data" : "false",
>>         "master" : "false"
>>       },
>>       "thread_pool" : {
>>         "generic" : {
>>           "threads" : 1,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 4,
>>           "completed" : 258
>>         },
>>         "index" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "get" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "snapshot" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "merge" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "suggest" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "bulk" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "optimize" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "warmer" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "flush" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "search" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "percolate" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "management" : {
>>           "threads" : 2,
>>           "queue" : 0,
>>           "active" : 1,
>>           "rejected" : 0,
>>           "largest" : 2,
>>           "completed" : 675
>>         },
>>         "refresh" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         }
>>       }
>>     },
>>     "LIVHf3iGSWuRnOjaA74UPA" : {
>>       "timestamp" : 1393351331174,
>>       "name" : "Drake, Frank",
>>       "transport_address" : "inet[/XXXXXXXXX]",
>>       "hostname" : "XXXXXXXXX",
>>       "attributes" : {
>>         "master" : "true"
>>       },
>>       "thread_pool" : {
>>         "generic" : {
>>           "threads" : 4,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 7,
>>           "completed" : 3142
>>         },
>>         "index" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "get" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "snapshot" : {
>>           "threads" : 5,
>>           "queue" : 3,
>>           "active" : 5,
>>           "rejected" : 0,
>>           "largest" : 5,
>>           "completed" : 3452
>>         },
>>         "merge" : {
>>           "threads" : 5,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 5,
>>           "completed" : 17338
>>         },
>>         "suggest" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "bulk" : {
>>           "threads" : 16,
>>           "queue" : 123,
>>           "active" : 15,
>>           "rejected" : 0,
>>           "largest" : 16,
>>           "completed" : 1649520
>>         },
>>         "optimize" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "warmer" : {
>>           "threads" : 4,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 4,
>>           "completed" : 3972
>>         },
>>         "flush" : {
>>           "threads" : 3,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 3,
>>           "completed" : 325
>>         },
>>         "search" : {
>>           "threads" : 48,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 48,
>>           "completed" : 8355
>>         },
>>         "percolate" : {
>>           "threads" : 0,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 0,
>>           "completed" : 0
>>         },
>>         "management" : {
>>           "threads" : 5,
>>           "queue" : 0,
>>           "active" : 1,
>>           "rejected" : 0,
>>           "largest" : 5,
>>           "completed" : 413367
>>         },
>>         "refresh" : {
>>           "threads" : 3,
>>           "queue" : 0,
>>           "active" : 0,
>>           "rejected" : 0,
>>           "largest" : 3,
>>           "completed" : 575
>>         }
>>       }
>>     }
>>   }
>> }
>>
>>
> Great feature. I was apparently losing messages because of an un-tuned 
> elasticsearch, and I didn't even know it until this revealed the problem.
>
>


-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[graylog2] Re: 20.1 - New feature reveals Rejected Execution Queues

Reply via email to