Hi Michael, This is a long shot, but if you can get it back into that state again, please to a "riak attach-direct" to the console if you started with "riak start", otherwise from the console type "ok." and hit enter a few times and see if that unclogs it.
You should see something like this (IP addresses changed to protect the innocent) jons-retina:dev1((riak_ee-2.0.0pre8)) jmeredith$ bin/riak attach-direct Direct Shell: Use "Ctrl-D" to quit. "Ctrl-C" will terminate the riak node. Attaching to /tmp//Users/jmeredith/basho/work/riak-ee-2.0pre8/dev/dev1/bin/../erlang.pipe.1 (^D to exit) ([email protected])1> ([email protected])1> ([email protected])1> ([email protected])1> ([email protected])1> ok. ok ([email protected])2> ([email protected])2> I'll be very interested if it unlocks the moment it prints the ok back to the console. Beware the ^D to leave if you direct attached - ^C will kill your server. Jon On Thu, Mar 27, 2014 at 10:38 AM, Michael Dillon <[email protected] > wrote: > I was running Riak 2.0pre11 but now see the same problem on pre20 > > I've reduced the Riak cluster to one single node, to eliminate the > inter-node communication from the issue. From another server I run a script > to do 100,000 inserts using the Python client (presumable 1.4.something). > Each insert is in a loop with 3 retries and it always specifies a server > timeout value. For this test, the HTML docs are small enough that the > 60,000 millisec default timeout value is always specified. Currently there > is no socket timeout specified on the client side. > > Part way through, one of the inserts hung. On investigation the Riak > server seemed in an OK state. I ran strace ps ax and it did not hang. > strace riak-admin status also was OK. top showed one of the riak processes > and strace -p PID showed that it was waiting in select. But then, after a > retry, the client continued to do inserts. That particular strace showed no > change so not sure whether the process was important. Ran top again and the > same process showed 80% CPU utilization. > > Then we got a full hang of Riak. The client did not retry because the > server did not timeout. It just hung and hung. Over 15 minutes as I write > this. When I ran strace ps ax on the Riak server, it hung reading > /proc/PID/cmdline where PID was the same as the one mentioned above. When I > run pstree -p (which never hangs) it shows this > > |-run_erl(4171)---beam.smp(4173)-+-cpu_sup(4473) > > 4173 is the PID that I have been talking about. Oddly enough, when I > opened a new SSH connection to this server, the strace ps ax which had been > hung on opening a /proc file, suddenly ran to completion. However, running > it again, hung again on the same line. Here are a few lines of strace ps ax > > stat("/proc/4173", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0 > > open("/proc/4173/stat", O_RDONLY) = 6 > > read(6, "4173 (beam.smp) S 4171 4173 4173"..., 1024) = 277 > > read(6, "", 747) = 0 > > close(6) = 0 > > open("/proc/4173/status", O_RDONLY) = 6 > > read(6, "Name:\tbeam.smp\nState:\tS (sleepin"..., 1024) = 787 > > read(6, "", 237) = 0 > > close(6) = 0 > > open("/proc/4173/cmdline", O_RDONLY) = 6 > > read(6, > > Any idea what is happening? > > When Riak is running normally, is there a way to identify a PID which > would be useful to attach to strace if I see this problem developing? Or > some other way to look at status of all the different beam.smp processes > and identify where the problem is located? > > Doesn't this indicate a problem with the way that Riak implements the > server timeout? Shouldn't some supervisor be killing and restarting a child > process or subtree when this occurs? > -- > Michael Dillon - Senior Software Engineer > PageFreezer.com > #200 - 311 Water Street > Vancouver, BC V6B 1B8 > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > -- Jon Meredith VP, Engineering Basho Technologies, Inc. [email protected]
_______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
