Hi Kishanthan, all,

This was a tricky situation and i was able to identify the issue and fix
it. This was caused by the new hazelcast upgrade.

There are two lists of members maintained in the
HazelcastGroupManagementAgent. A hazelcast distributed map shared by the
cluster which consists of all the members (members map) in the cluster and
a connected members list which is maintained per each subdomain in the
cluster. When a member leaves the cluster there's a MemberEntryListener and
GroupMembershipListener (and some other listeners) that gets notified. The
MemberEntryListener gets notified when the members map gets changed. And
when a member leaves, in the entryRemoved method of this listener we remove
the particular member that just left from the connectedMembers list as
well. And the event that it receives (EntryEvent) consists of the member
that left. In the current implementation this member is acquired from this
EntryEvent as follows,

entryEvent.getValue()

So in the code we do this,

connectedMembers.remove(entryEvent.getValue());

In the previous hazelcast version this returned the correct member. However
with the new hazelcast version this returns a null value which causes the
connected members list not getting updated properly. This is casued by a
fix in hazlecast [1] [2].

The TenantAwareLoadBalanceEndpoint in the ELB uses this connected members
list to get the next application member to serve the incoming request. This
was the cause that resulted for the ELB to try sending requests to
disconnected members and eventually become non-responsive.

As a fix i have identified that we can use the,

entryEvent.getOldValue()

to acquire the member that just left. (hazelcast issue [1] also suggests to
use it)

WDYT?

I have created the JIRA [3] for this issue and will send the PR with the
fix.

[1] https://github.com/hazelcast/hazelcast/issues/3198
[2] https://github.com/hazelcast/hazelcast/issues/3859
[3] https://wso2.org/jira/browse/CARBON-15057

Thanks.
/Gayashan

On Wed, Nov 5, 2014 at 5:47 PM, Kishanthan Thangarajah <[email protected]>
wrote:

> Gayashan, please share your latest findings on this.
>
> When we see the member left msg, the current member list is updated with
> that event (the member gets removed). So above can occur if that is not
> happening accordingly. We should also compare the same with and without
> hazelcast upgrade.
>
> On Fri, Oct 31, 2014 at 5:30 PM, Gayashan Amarasinghe <[email protected]>
> wrote:
>
>> Hi all,
>>
>> For Carbon testing we have a worker-mgt cluster fronted by ELB and
>> requests keep coming in from a jmeter client. During this if one (or more)
>> of the worker nodes were shutdown, after some time the ELB stops sending
>> requests to the nodes and the connection times out. Following log gets
>> printed in the ELB.
>>
>> ​​TID: [0] [ELB] [2014-10-31 06:27:32,517]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port:
>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker,
>> Active:true . Error Code: 101503
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:32,519]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain,
>> Host:172.31.7.214, Port:4100
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:32,738]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed over to Host:172.31.0.128, Remote Host:null, Port: 4100, HTTP:9763,
>> HTTPS:9443, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:32,740]  WARN
>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>> refused or failed for : /172.31.7.214:9765
>> {org.apache.synapse.transport.passthru.ConnectCallback}
>> TID: [0] [ELB] [2014-10-31 06:27:32,743]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port:
>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker,
>> Active:true . Error Code: 101503
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:32,745]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain,
>> Host:172.31.7.214, Port:4100
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:33,518]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765,
>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:33,520]  WARN
>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>> refused or failed for : /172.31.7.214:9765
>> {org.apache.synapse.transport.passthru.ConnectCallback}
>> TID: [0] [ELB] [2014-10-31 06:27:33,523]  WARN
>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>> refused or failed for : /172.31.7.214:9765
>> {org.apache.synapse.transport.passthru.ConnectCallback}
>> TID: [0] [ELB] [2014-10-31 06:27:33,744]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765,
>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:33,745]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port:
>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker,
>> Active:true . Error Code: 101503
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:33,745]  WARN
>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>> refused or failed for : /172.31.7.214:9765
>> {org.apache.synapse.transport.passthru.ConnectCallback}
>> TID: [0] [ELB] [2014-10-31 06:27:33,747]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain,
>> Host:172.31.7.214, Port:4100
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:34,746]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed over to Host:172.31.0.128, Remote Host:null, Port: 4100, HTTP:9763,
>> HTTPS:9443, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:34,748]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port:
>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker,
>> Active:true . Error Code: 101503
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:34,750]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain,
>> Host:172.31.7.214, Port:4100
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:35,749]  INFO
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765,
>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>> TID: [0] [ELB] [2014-10-31 06:27:35,750]  WARN
>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>> refused or failed for : /172.31.7.214:9765
>> {org.apache.synapse.transport.passthru.ConnectCallback}
>> TID: [0] [ELB] [2014-10-31 06:28:30,604]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-61
>> {org.apache.synapse.transport.passthru.SourceHandler}
>> TID: [0] [ELB] [2014-10-31 06:28:32,606]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-65
>> {org.apache.synapse.transport.passthru.SourceHandler}
>> TID: [0] [ELB] [2014-10-31 06:28:33,608]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-73
>> {org.apache.synapse.transport.passthru.SourceHandler}
>> TID: [0] [ELB] [2014-10-31 06:28:33,608]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-69
>> {org.apache.synapse.transport.passthru.SourceHandler}
>> TID: [0] [ELB] [2014-10-31 06:28:33,608]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-64
>> {org.apache.synapse.transport.passthru.SourceHandler}
>> TID: [0] [ELB] [2014-10-31 06:28:33,609]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-75
>> {org.apache.synapse.transport.passthru.SourceHandler}
>> TID: [0] [ELB] [2014-10-31 06:28:34,610]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-60
>> {org.apache.synapse.transport.passthru.SourceHandler}
>> TID: [0] [ELB] [2014-10-31 06:28:34,611]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-62
>> {org.apache.synapse.transport.passthru.SourceHandler}
>> TID: [0] [ELB] [2014-10-31 06:28:34,611]  WARN
>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>> out after request is read: http-incoming-67
>> {org.apache.synapse.transport.passthru.SourceHandler}
>>
>> ​Need to restart ​the ELB to recover from this. Any idea what's going on?
>> Is this a known issue? Can provide the full log if needed.
>>
>> ​Thanks.
>> /Gayashan​
>>
>> --
>> *Gayashan Amarasinghe*
>> Software Engineer | Platform TG
>> WSO2, Inc. | http://wso2.com
>> lean. enterprise. middleware
>>
>> Mobile : +94718314517
>> Blog : gayashan-a.blogspot.com
>>
>
>
>
> --
> *Kishanthan Thangarajah*
> Senior Software Engineer,
> Platform Technologies Team,
> WSO2, Inc.
> lean.enterprise.middleware
>
> Mobile - +94773426635
> Blog - *http://kishanthan.wordpress.com <http://kishanthan.wordpress.com>*
> Twitter - *http://twitter.com/kishanthan <http://twitter.com/kishanthan>*
>



-- 
*Gayashan Amarasinghe*
Software Engineer | Platform TG
WSO2, Inc. | http://wso2.com
lean. enterprise. middleware

Mobile : +94718314517
Blog : gayashan-a.blogspot.com
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to