[graylog2] 20.1 - New feature reveals Rejected Execution Queues

Scotty H Tue, 25 Feb 2014 10:14:53 -0800

The indexer failures section is a welcome addition - as I'm greeted with 
quite a few thousand (upwards of 50K after about 30 minutes) of these 
messages:


RemoteTransportException[[Suicide][inet[/1[ipaddress]:9300]][bulk/shard]]; 
> nested: EsRejectedExecutionException[rejected execution (queue capacity 50) 
> on 
> org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@2b5b09fb];


At any given time, my Graylog2 box is processing anywhere from 2,500 - 
5,500 messages/sec, with occasional spikes of 7K/sec. Right now I have 3.4 
billion messages, totaling to about 1.8TB.
I increased my shard count from 4 to 25, restarted, and cycled the 
deflector: That didn't seem to help. I found a thread speaking of thread 
count and queue size increases and decided to try that:
http://elasticsearch-users.115913.n3.nabble.com/Understanding-Threadpools-td4028445.html

So here's my custom elasticsearch performance vars out of my configuration 
(NOT in graylog2's configuration) (Some of these are not really needed, but 
I have so much memory to work with it doesn't matter):

indices.memory.index_buffer_size: 30%
> indices.memory.min_shard_index_buffer_size: 12mb
> indices.memory.min_index_buffer_size: 96mb
> index.refresh_interval: 30s
> index.translog.flush_threshold_ops: 5000
> threadpool.bulk.queue_size: 500



The relevant change to stop the rejections was increasing my threadpool 
bulk queue_size to 500. The default is 50. I was still getting an 
occasional queue-full rejection at 200. I could set to -1 to have it 
unbounded, but I feel like that's not a good practice. The bulk tasks seem 
to complete within milliseconds, but there are enough of them being 
instantiated at the same time for it to to fill up the queue when 50 is 
simply too small.

After about 15 minutes, here are my cluster stats where I happened to 
capture some active bulk queues. After a 1/4 second, the queues are empty 
again:


 

> http://ipaddress:9200/_nodes/thread_pool/stats?pretty=true
>
> {
>   "cluster_name" : "graylog2",
>   "nodes" : {
>     "c-9rpgQTQI68r91PicxmzA" : {
>       "timestamp" : 1393351331175,
>       "name" : "graylog2-server",
>       "transport_address" : "inet[/XXXXXXXX]",
>       "hostname" : "XXXXXXXXXX",
>       "attributes" : {
>         "client" : "true",
>         "data" : "false",
>         "master" : "false"
>       },
>       "thread_pool" : {
>         "generic" : {
>           "threads" : 1,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 4,
>           "completed" : 258
>         },
>         "index" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "get" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "snapshot" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "merge" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "suggest" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "bulk" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "optimize" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "warmer" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "flush" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "search" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "percolate" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "management" : {
>           "threads" : 2,
>           "queue" : 0,
>           "active" : 1,
>           "rejected" : 0,
>           "largest" : 2,
>           "completed" : 675
>         },
>         "refresh" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         }
>       }
>     },
>     "LIVHf3iGSWuRnOjaA74UPA" : {
>       "timestamp" : 1393351331174,
>       "name" : "Drake, Frank",
>       "transport_address" : "inet[/XXXXXXXXX]",
>       "hostname" : "XXXXXXXXX",
>       "attributes" : {
>         "master" : "true"
>       },
>       "thread_pool" : {
>         "generic" : {
>           "threads" : 4,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 7,
>           "completed" : 3142
>         },
>         "index" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "get" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "snapshot" : {
>           "threads" : 5,
>           "queue" : 3,
>           "active" : 5,
>           "rejected" : 0,
>           "largest" : 5,
>           "completed" : 3452
>         },
>         "merge" : {
>           "threads" : 5,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 5,
>           "completed" : 17338
>         },
>         "suggest" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "bulk" : {
>           "threads" : 16,
>           "queue" : 123,
>           "active" : 15,
>           "rejected" : 0,
>           "largest" : 16,
>           "completed" : 1649520
>         },
>         "optimize" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "warmer" : {
>           "threads" : 4,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 4,
>           "completed" : 3972
>         },
>         "flush" : {
>           "threads" : 3,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 3,
>           "completed" : 325
>         },
>         "search" : {
>           "threads" : 48,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 48,
>           "completed" : 8355
>         },
>         "percolate" : {
>           "threads" : 0,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 0,
>           "completed" : 0
>         },
>         "management" : {
>           "threads" : 5,
>           "queue" : 0,
>           "active" : 1,
>           "rejected" : 0,
>           "largest" : 5,
>           "completed" : 413367
>         },
>         "refresh" : {
>           "threads" : 3,
>           "queue" : 0,
>           "active" : 0,
>           "rejected" : 0,
>           "largest" : 3,
>           "completed" : 575
>         }
>       }
>     }
>   }
> }
>
>
Great feature. I was apparently losing messages because of an un-tuned 
elasticsearch, and I didn't even know it until this revealed the problem.

-- 
You received this message because you are subscribed to the Google Groups 
"graylog2" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

[graylog2] 20.1 - New feature reveals Rejected Execution Queues

Reply via email to