On Thu, Nov 6, 2014 at 9:00 PM, Gayashan Amarasinghe <[email protected]>
wrote:

> Hi Kishanthan, all,
>
> This was a tricky situation and i was able to identify the issue and fix
> it. This was caused by the new hazelcast upgrade.
>
> There are two lists of members maintained in the
> HazelcastGroupManagementAgent. A hazelcast distributed map shared by the
> cluster which consists of all the members (members map) in the cluster and
> a connected members list which is maintained per each subdomain in the
> cluster. When a member leaves the cluster there's a MemberEntryListener and
> GroupMembershipListener (and some other listeners) that gets notified. The
> MemberEntryListener gets notified when the members map gets changed. And
> when a member leaves, in the entryRemoved method of this listener we remove
> the particular member that just left from the connectedMembers list as
> well. And the event that it receives (EntryEvent) consists of the member
> that left. In the current implementation this member is acquired from this
> EntryEvent as follows,
>
> entryEvent.getValue()
>
> So in the code we do this,
>
> connectedMembers.remove(entryEvent.getValue());
>
> In the previous hazelcast version this returned the correct member.
> However with the new hazelcast version this returns a null value which
> causes the connected members list not getting updated properly. This is
> casued by a fix in hazlecast [1] [2].
>
> The TenantAwareLoadBalanceEndpoint in the ELB uses this connected members
> list to get the next application member to serve the incoming request. This
> was the cause that resulted for the ELB to try sending requests to
> disconnected members and eventually become non-responsive.
>
> As a fix i have identified that we can use the,
>
> entryEvent.getOldValue()
>
> to acquire the member that just left. (hazelcast issue [1] also suggests
> to use it)
>
> WDYT?
>

+1, looks like they have fixed the implementation properly and we should
use the above for member removed event. Good findings :)
Also I believe this only affects member removed event type and we don't
have to change any for member added events ?


>
> I have created the JIRA [3] for this issue and will send the PR with the
> fix.
>
> [1] https://github.com/hazelcast/hazelcast/issues/3198
> [2] https://github.com/hazelcast/hazelcast/issues/3859
> [3] https://wso2.org/jira/browse/CARBON-15057
>
> Thanks.
> /Gayashan
>
> On Wed, Nov 5, 2014 at 5:47 PM, Kishanthan Thangarajah <
> [email protected]> wrote:
>
>> Gayashan, please share your latest findings on this.
>>
>> When we see the member left msg, the current member list is updated with
>> that event (the member gets removed). So above can occur if that is not
>> happening accordingly. We should also compare the same with and without
>> hazelcast upgrade.
>>
>> On Fri, Oct 31, 2014 at 5:30 PM, Gayashan Amarasinghe <[email protected]>
>> wrote:
>>
>>> Hi all,
>>>
>>> For Carbon testing we have a worker-mgt cluster fronted by ELB and
>>> requests keep coming in from a jmeter client. During this if one (or more)
>>> of the worker nodes were shutdown, after some time the ELB stops sending
>>> requests to the nodes and the connection times out. Following log gets
>>> printed in the ELB.
>>>
>>> ​​TID: [0] [ELB] [2014-10-31 06:27:32,517]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port:
>>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker,
>>> Active:true . Error Code: 101503
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:32,519]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain,
>>> Host:172.31.7.214, Port:4100
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:32,738]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed over to Host:172.31.0.128, Remote Host:null, Port: 4100, HTTP:9763,
>>> HTTPS:9443, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:32,740]  WARN
>>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>>> refused or failed for : /172.31.7.214:9765
>>> {org.apache.synapse.transport.passthru.ConnectCallback}
>>> TID: [0] [ELB] [2014-10-31 06:27:32,743]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port:
>>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker,
>>> Active:true . Error Code: 101503
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:32,745]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain,
>>> Host:172.31.7.214, Port:4100
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:33,518]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765,
>>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:33,520]  WARN
>>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>>> refused or failed for : /172.31.7.214:9765
>>> {org.apache.synapse.transport.passthru.ConnectCallback}
>>> TID: [0] [ELB] [2014-10-31 06:27:33,523]  WARN
>>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>>> refused or failed for : /172.31.7.214:9765
>>> {org.apache.synapse.transport.passthru.ConnectCallback}
>>> TID: [0] [ELB] [2014-10-31 06:27:33,744]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765,
>>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:33,745]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port:
>>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker,
>>> Active:true . Error Code: 101503
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:33,745]  WARN
>>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>>> refused or failed for : /172.31.7.214:9765
>>> {org.apache.synapse.transport.passthru.ConnectCallback}
>>> TID: [0] [ELB] [2014-10-31 06:27:33,747]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain,
>>> Host:172.31.7.214, Port:4100
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:34,746]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed over to Host:172.31.0.128, Remote Host:null, Port: 4100, HTTP:9763,
>>> HTTPS:9443, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:34,748]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port:
>>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker,
>>> Active:true . Error Code: 101503
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:34,750]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain,
>>> Host:172.31.7.214, Port:4100
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:35,749]  INFO
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} -
>>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765,
>>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true
>>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint}
>>> TID: [0] [ELB] [2014-10-31 06:27:35,750]  WARN
>>> {org.apache.synapse.transport.passthru.ConnectCallback} -  Connection
>>> refused or failed for : /172.31.7.214:9765
>>> {org.apache.synapse.transport.passthru.ConnectCallback}
>>> TID: [0] [ELB] [2014-10-31 06:28:30,604]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-61
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>> TID: [0] [ELB] [2014-10-31 06:28:32,606]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-65
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>> TID: [0] [ELB] [2014-10-31 06:28:33,608]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-73
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>> TID: [0] [ELB] [2014-10-31 06:28:33,608]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-69
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>> TID: [0] [ELB] [2014-10-31 06:28:33,608]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-64
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>> TID: [0] [ELB] [2014-10-31 06:28:33,609]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-75
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>> TID: [0] [ELB] [2014-10-31 06:28:34,610]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-60
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>> TID: [0] [ELB] [2014-10-31 06:28:34,611]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-62
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>> TID: [0] [ELB] [2014-10-31 06:28:34,611]  WARN
>>> {org.apache.synapse.transport.passthru.SourceHandler} -  Connection time
>>> out after request is read: http-incoming-67
>>> {org.apache.synapse.transport.passthru.SourceHandler}
>>>
>>> ​Need to restart ​the ELB to recover from this. Any idea what's going
>>> on? Is this a known issue? Can provide the full log if needed.
>>>
>>> ​Thanks.
>>> /Gayashan​
>>>
>>> --
>>> *Gayashan Amarasinghe*
>>> Software Engineer | Platform TG
>>> WSO2, Inc. | http://wso2.com
>>> lean. enterprise. middleware
>>>
>>> Mobile : +94718314517
>>> Blog : gayashan-a.blogspot.com
>>>
>>
>>
>>
>> --
>> *Kishanthan Thangarajah*
>> Senior Software Engineer,
>> Platform Technologies Team,
>> WSO2, Inc.
>> lean.enterprise.middleware
>>
>> Mobile - +94773426635
>> Blog - *http://kishanthan.wordpress.com
>> <http://kishanthan.wordpress.com>*
>> Twitter - *http://twitter.com/kishanthan <http://twitter.com/kishanthan>*
>>
>
>
>
> --
> *Gayashan Amarasinghe*
> Software Engineer | Platform TG
> WSO2, Inc. | http://wso2.com
> lean. enterprise. middleware
>
> Mobile : +94718314517
> Blog : gayashan-a.blogspot.com
>



-- 
*Kishanthan Thangarajah*
Senior Software Engineer,
Platform Technologies Team,
WSO2, Inc.
lean.enterprise.middleware

Mobile - +94773426635
Blog - *http://kishanthan.wordpress.com <http://kishanthan.wordpress.com>*
Twitter - *http://twitter.com/kishanthan <http://twitter.com/kishanthan>*
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to