Jon, Thanks so much. Yes beam.swp was maxing cpu and memory on the 1 node. I managed to get it to exit and now its a 3 node cluster. I'll take your advice on changing the handoff.
On Tue, Mar 20, 2012 at 3:57 PM, Jon Meredith <[email protected]> wrote: > Hi Michael, > > When you say 'Only thing node is failing', do you mean the hardware is > failing (drives etc) or a problem with Riak itself? If it's a riak problem > sharing the error messages from the logs would be helpful. > > The fix will be included in the next point release, but we haven't set a > date for that yet. The files for packaging Riak are included with the > distribution but you'll need to get erlang set up and built as well to be > able to build it (make package, you'll have to set a few variables our > build system uses), but I'd recommend holding off until we do an official > version. > > If you'd like to increase the amount of handoff you can add > {handoff_concurrency, 4} to the riak_core section of app.config which will > take effect next restarts or you could attach to the riak console (riak > attach) and run > > rpc:multicall(riak_core_handoff_manager, set_concurrency, [4], 5000). > > (the dot is important), then ^D to disconnect. > > The handoff concurrency value was reduced from 4 to 1 for the 1.0.3 > release around concerns that users building larger clusters would overwhelm > new nodes when they were added as the concurrency value applied to outbound > handoff. For 1.1 we've changed things so that the concurrency value > applied to inbound and outbound so it is safer to set it higher. > > As part of the pull request above we've also changed the logging slightly > to avoid printing out handoff starting messages until handoff succeeds. In > 1.1.0/1.1.1 when handoff concurrency is exceeded you may see repeats of > 'Starting handoff' messages if the destination node denies the transfer due > to hitting the limit. > > Cheers, Jon. > > > On Tue, Mar 20, 2012 at 4:28 PM, Michael Clemmons <[email protected] > > wrote: > >> [apologies for the delay on this email sent to armon only first] >> >> I'm having similar issues on a testing cluster for 1.1.1rc1. I'm having >> 1 out of 4 nodes failing multiple times and not restarting well, there are >> like 100 pending transfers. Only thing node is failing. I've stopped >> pointing traffic at the nodes and have attempted to remove this machine >> from the cluster. >> Its slowly leaving but is moving very slowly for not much data, the >> metadata is important and loosing any would be a significant time >> consumer(but obviously not vital since we used a day old build). >> What the likely hood that pull request will make it into a deb build in >> the near future or will the make file generate a deb? >> -Michael >> >> >> On Mon, Mar 19, 2012 at 11:40 AM, Armon Dadgar <[email protected]>wrote: >> >>> Okay, good to know this is a known issue. I attached the >>> logs for the last time this occurred in my original email. >>> >>> I'll try to capture this information if the problem occurs again. >>> Thanks. >>> >>> Best Regards, >>> >>> Armon Dadgar >>> >>> On Mar 19, 2012, at 11:36 AM, Jon Meredith wrote: >>> >>> Hi Armon, >>> >>> We've recently patched an issue that affects handoffs here >>> https://github.com/basho/riak_core/pull/153 >>> >>> If the issue repeats for you, as well as the logs it would be very >>> useful if you could follow the instructions from the pull request above ro >>> the 'riak_core_handoff_manager:status().' command against all nodes. >>> >>> The pull request works around an issue where it looks like the kernel >>> has closed a socket (no evidence of it any longer with netstat/ss) but the >>> erlang process is still stuck in an receive call from it (gen_tcp:recv/2 to >>> be more precise). >>> >>> Please let us know if you hit it again. >>> >>> Best, Jon. >>> >>> On Mon, Mar 19, 2012 at 12:10 PM, Armon Dadgar >>> <[email protected]>wrote: >>> >>>> I wanted to ping the mailing list and see if anybody else has >>>> encountered >>>> stalls in the partition handoffs on Riak 1.1. We added a new node to >>>> our cluster >>>> last Friday, but noticed that the partition handoffs appear to have >>>> stopped >>>> after about 7-8 hours. >>>> >>>> Most of the handoffs completed, and the only handoffs that remained >>>> were from node 3 to node 2. >>>> The ring claimant (node 1), indicated that node 3 was unreachable (via >>>> ring_status). >>>> However, Riak control did not indicate that node 3 was unreachable, and >>>> in fact it was >>>> actually live and continuing to serve request. >>>> >>>> To resolve this, I tried to just restart node 3. I ran "riak stop" >>>> multiple times, but this did >>>> not actually seem to do anything (The node was continuing to run and >>>> serve requests). >>>> Next, I attached to the node and ran "init:stop()." This started to >>>> shut down various >>>> sub-systems, but the node was still running. Sending a SIGTERM >>>> signal to the beam vm >>>> finally killed it. Restarting the node with "riak start" worked as >>>> expected, >>>> and the node promptly resumed the handoffs, and finished in a few hours. >>>> >>>> I'm not sure exactly what the issue was, but something seemed to cause a >>>> stalling of the handoffs. >>>> >>>> I've attached the contents of our console.log, erlang.log, error.log >>>> and crash.log >>>> from the relevant times if that is useful. >>>> >>>> Best Regards, >>>> >>>> Armon Dadgar >>>> >>>> >>>> >>>> _______________________________________________ >>>> riak-users mailing list >>>> [email protected] >>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>>> >>>> >>> >>> >>> -- >>> Jon Meredith >>> Platform Engineering Manager >>> Basho Technologies, Inc. >>> [email protected] >>> >>> >>> >>> _______________________________________________ >>> riak-users mailing list >>> [email protected] >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >> >> >> -- >> -Michael >> >> > > > -- > Jon Meredith > Platform Engineering Manager > Basho Technologies, Inc. > [email protected] > > -- -Michael
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
