We're still seeing node drops, and what is more bizzare is we're seeing 
this on a test cluster we stood up that actually has no activity on it (no 
reads or writes going to it).  Does anyone have any additional thoughts? 
 Here is the info from the configuration and the logs we're seeing on the 
drops.

SC-TLS1 - 4GB Memory 1GB Heap (Master)

SC-TLS2 - 4GB Memory 1GB Heap (Master)

SC-TLS3 - 8GB Memory 1GB Heap (Data)

SC-TLS4 - 8GB Memory 1GB Heap (Data)

SC-TLS5 - 8GB Memory 1GB Heap (Data)

 

PX-TLS3 - 8GB Memory 1GB Heap (Data)

PX-TLS4 - 8GB Memory 1GB Heap (Data)

PX-TLS5 - 8GB Memory 1GB Heap (Data)

 

*ElasticSearch* 1.0.1

 

*Elasticsearch Configuration Settings*

bootstrap.mlockall: true

discovery.zen.ping.timeout: 15s

discovery.zen.ping.multicast.enabled: false

discovery.zen.ping.unicast.hosts: ["10.9.84.206[9300-9400]", 
"10.9.84.213[9300-9400]"]

action.destructive_requires_name: true

discovery.zen.fd.ping_interval: 30s

discovery.zen.fd.ping_timeout: 120s

discovery.zen.fd.ping_retries: 10

 

*Events from TLS1 (Master)*

[2014-05-26 22:09:22,953][INFO ][cluster.service          ] [SC-TLS1] 
removed 
{[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[/10.9.64.223:9300]]{dc=PX, 
master=false},}, reason: zen-disco-receive(from master 
[[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, 
data=false, master=true}])

[2014-05-26 22:12:07,085][INFO ][cluster.service          ] [SC-TLS1] added 
{[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[/10.9.64.223:9300]]{dc=PX, 
master=false},}, reason: zen-disco-receive(from master 
[[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, 
data=false, master=true}])

 

*Events from PX-TLS5*

[2014-05-26 22:09:37,010][INFO ][discovery.zen            ] [PX-TLS5] 
master_left 
[[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, 
data=false, master=true}], reason [do not exists on master, act as master 
failure]

[2014-05-26 22:09:37,011][INFO ][cluster.service          ] [PX-TLS5] 
master {new 
[SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC, 
data=false, master=true}, previous 
[SCTLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, 
data=false, master=true}}, removed 
{[SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, 
data=false, master=true},}, reason: zen-disco-master_failed 
([SC-TLS2][8hdMizOCRz-wufVkI-IaRw][SC-tls2][inet[/10.9.84.213:9300]]{dc=SC, 
data=false, master=true})

[2014-05-26 22:10:07,035][INFO ][discovery.zen            ] [PX-TLS5] 
master_left 
[[SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC, 
data=false, master=true}], reason [no longer master]

[2014-05-26 22:10:07,036][WARN ][discovery.zen            ] [PX-TLS5] not 
enough master nodes after master left (reason = no longer master), current 
nodes: 
{[PX-TLS5][Ld8VcLgfRs2roHUWS8c6mA][PX-TLS5][inet[PX-TLS5/10.9.64.223:9300]]{dc=PX,
 
master=false},[PX-PRD-TLS3][t9ZGWrc0Qi2ASDF5te75Pw][PX-prd-tls3][inet[/10.9.64.213:9300]]{dc=PX,
 
master=false},[SC-TLS5][NulqNMVoQiu2nu4p6w8Usg][SC-tls5][inet[/10.9.84.210:9300]]{dc=SC,
 
master=false},[SC-TLS4][DGWDAMr9QYmN5nNjFNMyjw][SC-tls4][inet[/10.9.84.209:9300]]{dc=SC,
 
master=false},[SC-TLS3][0QNRAMFRSgizAfWO9yxBdw][SC-tls3][inet[/10.9.84.214:9300]]{dc=SC,
 
master=false},[PX-PRD-TLS4][4gh2_7c2RiWY9MZQCuJtjw][PX-prd-tls4][inet[/10.9.64.214:9300]]{dc=PX,
 
master=false},}

[2014-05-26 22:10:07,037][INFO ][cluster.service          ] [PX-TLS5] 
removed 
{[SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC, 
data=false, 
master=true},[PX-PRD-TLS3][t9ZGWrc0Qi2ASDF5te75Pw][PX-prd-tls3][inet[/10.9.64.213:9300]]{dc=PX,
 
master=false},[SC-TLS5][NulqNMVoQiu2nu4p6w8Usg][SC-tls5][inet[/10.9.84.210:9300]]{dc=SC,
 
master=false},[SC-TLS4][DGWDAMr9QYmN5nNjFNMyjw][SC-tls4][inet[/10.9.84.209:9300]]{dc=SC,
 
master=false},[SC-TLS3][0QNRAMFRSgizAfWO9yxBdw][SC-tls3][inet[/10.9.84.214:9300]]{dc=SC,
 
master=false},[PX-PRD-TLS4][4gh2_7c2RiWY9MZQCuJtjw][PX-prd-tls4][inet[/10.9.64.214:9300]]{dc=PX,
 
master=false},}, reason: zen-disco-master_failed 
([SC-TLS1][fDW1-5P8RzWgZwGEG2BJhQ][SC-TLS1][inet[/10.9.84.206:9300]]{dc=SC, 
data=false, master=true})

 

On Monday, April 28, 2014 9:39:04 AM UTC-6, [email protected] wrote:
>
> So far the only log message we've seen is: 
>  
> zen-disco-node_failed([CDPX-PRD-ELS4][lkquUBfHT1aXAO3-_tCNCg][cdpx-prd-els4][inet[
> 10.9.64.142/10.9.64.142:9300]]{master=false} 
> <http://10.9.64.142/10.9.64.142:9300%5D%5D%7Bmaster=false%7D>), reason 
> failed to ping, tried [5] times, each with maximum [1m] timeout
>
> We have other data traversing the network that would be very sensitive to 
> any latency or outages, in addition to alerts that would fire off if we had 
> a network outage, so I am confident we don't have any network issues when 
> this occurs.  Furthermore, we are only seeing data nodes drop, the masters 
> never drop.
>
> Is there a recommended heap size for nodes that are masters only?  In 
> addition, any recommendations on heap size for data nodes?  I assume this 
> could be a timeout in general during GC processes as our data nodes have 
> larger heaps?
>
> On Friday, April 25, 2014 5:49:44 PM UTC-6, Alexander Reelsen wrote:
>>
>> Hey,
>>
>> is there any reason in the logfile of the master node, why it was 
>> deelected? (network outage as well)? Did you give your master nodes also a 
>> huge heap which could cause long outages during GC?
>>
>>
>> --Alex
>>
>>
>> On Mon, Apr 21, 2014 at 5:51 PM, <[email protected]> wrote:
>>
>>> We currently are running dedicated master nodes but I believe they are 
>>> also servicing queries.  I can change it such that queries only hit the 
>>> data nodes and see if that eliminates the issue...
>>>
>>> On Monday, April 21, 2014 3:40:59 PM UTC-6, Binh Ly wrote:
>>>>
>>>> Other than network, is it possible that your nodes could sometimes be 
>>>> overloaded such that they cannot respond immediately? If that's the case, 
>>>> then you can probably get 3 nodes (servers), make them master-only nodes 
>>>> (node.master: true, node.data: false). Set 
>>>> discovery.zen.minimum_master_nodes: 
>>>> 2 for those 3 nodes. And then for the rest of your other data nodes, make 
>>>> them non-master eligible (node.master: false, node.data: true). This way 
>>>> you have 3 nodes dedicated only to do cluster state/master tasks unimpeded 
>>>> by load or anything else other than your network. Just don't run anything 
>>>> else on them or send queries/indexing jobs to these 3 nodes. :)
>>>>
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/elasticsearch/4858a2da-5ceb-48f1-8cfe-fe460ab2dcce%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dffb154d-080d-4366-980c-b9401a9b3859%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to