On Thu, Nov 6, 2014 at 9:00 PM, Gayashan Amarasinghe <[email protected]> wrote:
> Hi Kishanthan, all, > > This was a tricky situation and i was able to identify the issue and fix > it. This was caused by the new hazelcast upgrade. > > There are two lists of members maintained in the > HazelcastGroupManagementAgent. A hazelcast distributed map shared by the > cluster which consists of all the members (members map) in the cluster and > a connected members list which is maintained per each subdomain in the > cluster. When a member leaves the cluster there's a MemberEntryListener and > GroupMembershipListener (and some other listeners) that gets notified. The > MemberEntryListener gets notified when the members map gets changed. And > when a member leaves, in the entryRemoved method of this listener we remove > the particular member that just left from the connectedMembers list as > well. And the event that it receives (EntryEvent) consists of the member > that left. In the current implementation this member is acquired from this > EntryEvent as follows, > > entryEvent.getValue() > > So in the code we do this, > > connectedMembers.remove(entryEvent.getValue()); > > In the previous hazelcast version this returned the correct member. > However with the new hazelcast version this returns a null value which > causes the connected members list not getting updated properly. This is > casued by a fix in hazlecast [1] [2]. > > The TenantAwareLoadBalanceEndpoint in the ELB uses this connected members > list to get the next application member to serve the incoming request. This > was the cause that resulted for the ELB to try sending requests to > disconnected members and eventually become non-responsive. > > As a fix i have identified that we can use the, > > entryEvent.getOldValue() > > to acquire the member that just left. (hazelcast issue [1] also suggests > to use it) > > WDYT? > +1, looks like they have fixed the implementation properly and we should use the above for member removed event. Good findings :) Also I believe this only affects member removed event type and we don't have to change any for member added events ? > > I have created the JIRA [3] for this issue and will send the PR with the > fix. > > [1] https://github.com/hazelcast/hazelcast/issues/3198 > [2] https://github.com/hazelcast/hazelcast/issues/3859 > [3] https://wso2.org/jira/browse/CARBON-15057 > > Thanks. > /Gayashan > > On Wed, Nov 5, 2014 at 5:47 PM, Kishanthan Thangarajah < > [email protected]> wrote: > >> Gayashan, please share your latest findings on this. >> >> When we see the member left msg, the current member list is updated with >> that event (the member gets removed). So above can occur if that is not >> happening accordingly. We should also compare the same with and without >> hazelcast upgrade. >> >> On Fri, Oct 31, 2014 at 5:30 PM, Gayashan Amarasinghe <[email protected]> >> wrote: >> >>> Hi all, >>> >>> For Carbon testing we have a worker-mgt cluster fronted by ELB and >>> requests keep coming in from a jmeter client. During this if one (or more) >>> of the worker nodes were shutdown, after some time the ELB stops sending >>> requests to the nodes and the connection times out. Following log gets >>> printed in the ELB. >>> >>> TID: [0] [ELB] [2014-10-31 06:27:32,517] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port: >>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, >>> Active:true . Error Code: 101503 >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:32,519] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain, >>> Host:172.31.7.214, Port:4100 >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:32,738] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed over to Host:172.31.0.128, Remote Host:null, Port: 4100, HTTP:9763, >>> HTTPS:9443, Domain: wso2.as.domain, Sub-domain:worker, Active:true >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:32,740] WARN >>> {org.apache.synapse.transport.passthru.ConnectCallback} - Connection >>> refused or failed for : /172.31.7.214:9765 >>> {org.apache.synapse.transport.passthru.ConnectCallback} >>> TID: [0] [ELB] [2014-10-31 06:27:32,743] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port: >>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, >>> Active:true . Error Code: 101503 >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:32,745] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain, >>> Host:172.31.7.214, Port:4100 >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:33,518] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765, >>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:33,520] WARN >>> {org.apache.synapse.transport.passthru.ConnectCallback} - Connection >>> refused or failed for : /172.31.7.214:9765 >>> {org.apache.synapse.transport.passthru.ConnectCallback} >>> TID: [0] [ELB] [2014-10-31 06:27:33,523] WARN >>> {org.apache.synapse.transport.passthru.ConnectCallback} - Connection >>> refused or failed for : /172.31.7.214:9765 >>> {org.apache.synapse.transport.passthru.ConnectCallback} >>> TID: [0] [ELB] [2014-10-31 06:27:33,744] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765, >>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:33,745] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port: >>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, >>> Active:true . Error Code: 101503 >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:33,745] WARN >>> {org.apache.synapse.transport.passthru.ConnectCallback} - Connection >>> refused or failed for : /172.31.7.214:9765 >>> {org.apache.synapse.transport.passthru.ConnectCallback} >>> TID: [0] [ELB] [2014-10-31 06:27:33,747] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain, >>> Host:172.31.7.214, Port:4100 >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:34,746] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed over to Host:172.31.0.128, Remote Host:null, Port: 4100, HTTP:9763, >>> HTTPS:9443, Domain: wso2.as.domain, Sub-domain:worker, Active:true >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:34,748] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed to send message to Member Host:172.31.7.214, Remote Host:null, Port: >>> 4100, HTTP:9765, HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, >>> Active:true . Error Code: 101503 >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:34,750] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Dropping the faulty/unreachable Member with Domain:wso2.as.domain, >>> Host:172.31.7.214, Port:4100 >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:35,749] INFO >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} - >>> Failed over to Host:172.31.7.214, Remote Host:null, Port: 4100, HTTP:9765, >>> HTTPS:9445, Domain: wso2.as.domain, Sub-domain:worker, Active:true >>> {org.wso2.carbon.lb.endpoint.endpoint.TenantAwareLoadBalanceEndpoint} >>> TID: [0] [ELB] [2014-10-31 06:27:35,750] WARN >>> {org.apache.synapse.transport.passthru.ConnectCallback} - Connection >>> refused or failed for : /172.31.7.214:9765 >>> {org.apache.synapse.transport.passthru.ConnectCallback} >>> TID: [0] [ELB] [2014-10-31 06:28:30,604] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-61 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> TID: [0] [ELB] [2014-10-31 06:28:32,606] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-65 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> TID: [0] [ELB] [2014-10-31 06:28:33,608] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-73 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> TID: [0] [ELB] [2014-10-31 06:28:33,608] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-69 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> TID: [0] [ELB] [2014-10-31 06:28:33,608] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-64 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> TID: [0] [ELB] [2014-10-31 06:28:33,609] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-75 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> TID: [0] [ELB] [2014-10-31 06:28:34,610] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-60 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> TID: [0] [ELB] [2014-10-31 06:28:34,611] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-62 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> TID: [0] [ELB] [2014-10-31 06:28:34,611] WARN >>> {org.apache.synapse.transport.passthru.SourceHandler} - Connection time >>> out after request is read: http-incoming-67 >>> {org.apache.synapse.transport.passthru.SourceHandler} >>> >>> Need to restart the ELB to recover from this. Any idea what's going >>> on? Is this a known issue? Can provide the full log if needed. >>> >>> Thanks. >>> /Gayashan >>> >>> -- >>> *Gayashan Amarasinghe* >>> Software Engineer | Platform TG >>> WSO2, Inc. | http://wso2.com >>> lean. enterprise. middleware >>> >>> Mobile : +94718314517 >>> Blog : gayashan-a.blogspot.com >>> >> >> >> >> -- >> *Kishanthan Thangarajah* >> Senior Software Engineer, >> Platform Technologies Team, >> WSO2, Inc. >> lean.enterprise.middleware >> >> Mobile - +94773426635 >> Blog - *http://kishanthan.wordpress.com >> <http://kishanthan.wordpress.com>* >> Twitter - *http://twitter.com/kishanthan <http://twitter.com/kishanthan>* >> > > > > -- > *Gayashan Amarasinghe* > Software Engineer | Platform TG > WSO2, Inc. | http://wso2.com > lean. enterprise. middleware > > Mobile : +94718314517 > Blog : gayashan-a.blogspot.com > -- *Kishanthan Thangarajah* Senior Software Engineer, Platform Technologies Team, WSO2, Inc. lean.enterprise.middleware Mobile - +94773426635 Blog - *http://kishanthan.wordpress.com <http://kishanthan.wordpress.com>* Twitter - *http://twitter.com/kishanthan <http://twitter.com/kishanthan>*
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
