Re: Startup issues with ES 1.3.5

Chris Moore Fri, 05 Dec 2014 12:30:34 -0800

I mean they aren't logging anything (until I send the shutdown command, a 
node leaves, etc). It's not that I feel like there's an issue with the 
logging; the data nodes just have nothing to log because everything seems 
fine to them. I have attached a log from one of the data nodes showing this 
with a notation of when the master node first reported an error and when I 
issued SIGTERM to all of the ES instances.


On Friday, December 5, 2014 2:39:27 PM UTC-5, Support Monkey wrote:
>
> I see. When you say the data nodes have literally nothing in their logs, 
> you mean they aren't logging anything or just nothing interesting?
>
> On Friday, December 5, 2014 7:10:13 AM UTC-8, Chris Moore wrote:
>>
>> This is running on Amazon EC2 in a VPC on dedicated instances. Physical 
>> network infrastructure is likely fine. Are there specific network issues 
>> you think we should look into?
>>
>> When we are in a problem state, we can communicate between the nodes just 
>> fine. I can run curl requests to ES (health checks, etc) from the master 
>> node to the data nodes directly and they return as expected. So, there 
>> doesn't seem to be a socket exhaustion issue (additionally there are no 
>> kernel errors being reported).
>>
>> It feels like there is a queue/buffer filling up somewhere that once it 
>> has availability again, things start working. But, /_cat/thread_pool?v 
>> doesn't show anything above 0 (although, when we are in the problem state, 
>> it doesn't return a response if run on master), nodes/hot_threads doesn't 
>> show anything going on, etc.
>>
>> On Thursday, December 4, 2014 4:10:37 PM UTC-5, Support Monkey wrote:
>>>
>>> I would think the network is a prime suspect then, as there is no 
>>> significant difference between 1.2.x and 1.3.x in relation to memory usage. 
>>> And you'd certainly see OOMs in node logs if it was a memory issue.
>>>
>>> On Thursday, December 4, 2014 12:45:58 PM UTC-8, Chris Moore wrote:
>>>>
>>>> There is nothing (literally) in the log of either data node after the 
>>>> node joined events and nothing in the master log between index recovery 
>>>> and 
>>>> the first error message.
>>>>
>>>> There are 0 queries run before the errors start occurring (access to 
>>>> the nodes is blocked via a firewall, so the only communications are 
>>>> between 
>>>> the nodes). We have 50% of the RAM allocated to the heap on each node (4GB 
>>>> each).
>>>>
>>>> This cluster operated without issue under 1.1.2. Did something change 
>>>> between 1.1.2 and 1.3.5 that drastically increased idle heap requirements?
>>>>
>>>>
>>>> On Thursday, December 4, 2014 3:29:23 PM UTC-5, Support Monkey wrote:
>>>>>
>>>>> Generally ReceiveTimeoutTransportException is due to network 
>>>>> disconnects or a node failing to respond due to heavy load. What does the 
>>>>> log of pYi3z5PgRh6msJX_armz_A show you? Perhaps it has too little heap 
>>>>> allocated. Rule of thumb is 1/2 available memory but <= 31GB
>>>>>
>>>>> On Wednesday, December 3, 2014 12:52:58 PM UTC-8, Jeff Keller wrote:
>>>>>>
>>>>>>
>>>>>> ES Version: 1.3.5
>>>>>>
>>>>>> OS: Ubuntu 14.04.1 LTS
>>>>>>
>>>>>> Machine: 2 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 8 GB RAM at AWS
>>>>>>
>>>>>> master (ip-10-0-1-18), 2 data nodes (ip-10-0-1-19, ip-10-0-1-20)
>>>>>>
>>>>>>
>>>>>> *After upgrading from ES 1.1.2...*
>>>>>>
>>>>>>
>>>>>> 1. Startup ES on master
>>>>>> 2. All nodes join cluster
>>>>>> 3. [2014-12-03 20:30:54,789][INFO ][gateway                  ] 
>>>>>> [ip-10-0-1-18.ec2.internal] recovered [157] indices into cluster_state
>>>>>> 4. Checked health a few times
>>>>>>
>>>>>>
>>>>>> curl -XGET localhost:9200/_cat/health?v
>>>>>>
>>>>>>
>>>>>> 5. 6 minutes after cluster recovery initiates (and 5:20 after the 
>>>>>> recovery finishes), the log on the master node (10.0.1.18) reports:
>>>>>>
>>>>>>
>>>>>> [2014-12-03 20:36:57,532][DEBUG][action.admin.cluster.node.stats] 
>>>>>> [ip-10-0-1-18.ec2.internal] failed to execute on node 
>>>>>> [pYi3z5PgRh6msJX_armz_A]
>>>>>>
>>>>>> org.elasticsearch.transport.ReceiveTimeoutTransportException: 
>>>>>> [ip-10-0-1-20.ec2.internal][inet[/10.0.1.20:9300]][cluster/nodes/stats/n]
>>>>>>  
>>>>>> request_id [17564] timed out after [15001ms]
>>>>>>
>>>>>> at 
>>>>>> org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:356)
>>>>>>
>>>>>> at 
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>>>
>>>>>> at 
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>>>
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>>
>>>>>> 6. Every 30 seconds or 60 seconds, the above error is reported for 
>>>>>> one or more of the data nodes
>>>>>>
>>>>>> 7. During this time, queries (search, index, etc.) don’t return. They 
>>>>>> hang until the error state temporarily resolves itself (a varying time 
>>>>>> around 15-20 minutes) at which point the expected result is returned.
>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0da74000-3147-4f42-9513-f77b7c419a22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[2014-12-05 14:56:00,324][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] version[1.3.5], pid[1061], 
build[4a50e7d/2014-11-05T15:21:28Z]
[2014-12-05 14:56:00,325][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] initializing ...
[2014-12-05 14:56:00,378][INFO ][plugins                  ] 
[ip-10-0-1-20.ec2.internal] loaded [cloud-aws], sites []
[2014-12-05 14:56:08,974][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] initialized
[2014-12-05 14:56:08,974][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] starting ...
[2014-12-05 14:56:09,065][INFO ][transport                ] 
[ip-10-0-1-20.ec2.internal] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, 
publish_address {inet[/10.0.1.20:9300]}
[2014-12-05 14:56:09,074][INFO ][discovery                ] 
[ip-10-0-1-20.ec2.internal] elasticsearch-beta/fkoxZfh5S-u1AEJB486JtQ
[2014-12-05 14:56:16,639][INFO ][cluster.service          ] 
[ip-10-0-1-20.ec2.internal] detected_master 
[ip-10-0-1-18.ec2.internal][rHgrCXfgSlqtY7qKv8Lkaw][ip-10-0-1-18][inet[/10.0.1.18:9300]]{data=false,
 master=true}, added 
{[ip-10-0-1-18.ec2.internal][rHgrCXfgSlqtY7qKv8Lkaw][ip-10-0-1-18][inet[/10.0.1.18:9300]]{data=false,
 master=true},}, reason: zen-disco-receive(from master 
[[ip-10-0-1-18.ec2.internal][rHgrCXfgSlqtY7qKv8Lkaw][ip-10-0-1-18][inet[/10.0.1.18:9300]]{data=false,
 master=true}])
[2014-12-05 14:56:16,646][INFO ][http                     ] 
[ip-10-0-1-20.ec2.internal] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, 
publish_address {inet[/10.0.1.20:9200]}
[2014-12-05 14:56:16,647][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] started
[2014-12-05 14:56:17,597][INFO ][cluster.service          ] 
[ip-10-0-1-20.ec2.internal] added 
{[ip-10-0-1-19.ec2.internal][qfYcdWRfTb-yWP11Zi6B5A][ip-10-0-1-19][inet[/10.0.1.19:9300]]{master=false},},
 reason: zen-disco-receive(from master 
[[ip-10-0-1-18.ec2.internal][rHgrCXfgSlqtY7qKv8Lkaw][ip-10-0-1-18][inet[/10.0.1.18:9300]]{data=false,
 master=true}])
******* I told all 3 servers to stop ES at this point (15:04:02) The first 
error in the master's log appeared at 15:02:26 *********
[2014-12-05 15:04:02,653][INFO ][discovery.ec2            ] 
[ip-10-0-1-20.ec2.internal] master_left 
[[ip-10-0-1-18.ec2.internal][rHgrCXfgSlqtY7qKv8Lkaw][ip-10-0-1-18][inet[/10.0.1.18:9300]]{data=false,
 master=true}], reason [transport disconnected (with verified connect)]
[2014-12-05 15:04:02,654][WARN ][discovery.ec2            ] 
[ip-10-0-1-20.ec2.internal] not enough master nodes after master left (reason = 
transport disconnected (with verified connect)), current nodes: 
{[ip-10-0-1-20.ec2.internal][fkoxZfh5S-u1AEJB486JtQ][ip-10-0-1-20][inet[/10.0.1.20:9300]]{master=false},[ip-10-0-1-19.ec2.internal][qfYcdWRfTb-yWP11Zi6B5A][ip-10-0-1-19][inet[/10.0.1.19:9300]]{master=false},}
[2014-12-05 15:04:02,657][INFO ][cluster.service          ] 
[ip-10-0-1-20.ec2.internal] removed 
{[ip-10-0-1-18.ec2.internal][rHgrCXfgSlqtY7qKv8Lkaw][ip-10-0-1-18][inet[/10.0.1.18:9300]]{data=false,
 
master=true},[ip-10-0-1-19.ec2.internal][qfYcdWRfTb-yWP11Zi6B5A][ip-10-0-1-19][inet[/10.0.1.19:9300]]{master=false},},
 reason: zen-disco-master_failed 
([ip-10-0-1-18.ec2.internal][rHgrCXfgSlqtY7qKv8Lkaw][ip-10-0-1-18][inet[/10.0.1.18:9300]]{data=false,
 master=true})
[2014-12-05 15:04:12,818][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] stopping ...
[2014-12-05 15:04:12,827][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] stopped
[2014-12-05 15:04:12,827][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] closing ...
[2014-12-05 15:04:12,830][INFO ][node                     ] 
[ip-10-0-1-20.ec2.internal] closed

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0da74000-3147-4f42-9513-f77b7c419a22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Startup issues with ES 1.3.5

Reply via email to