Hi Michael,

This is a long shot, but if you can get it back into that state again,
please to a "riak attach-direct" to the console if you started with "riak
start", otherwise from the console type "ok." and hit enter a few times and
see if that unclogs it.

You should see something like this (IP addresses changed to protect the
innocent)

jons-retina:dev1((riak_ee-2.0.0pre8)) jmeredith$ bin/riak attach-direct
Direct Shell: Use "Ctrl-D" to quit. "Ctrl-C" will terminate the riak node.
Attaching to
/tmp//Users/jmeredith/basho/work/riak-ee-2.0pre8/dev/dev1/bin/../erlang.pipe.1
(^D to exit)

([email protected])1>
([email protected])1>
([email protected])1>
([email protected])1>
([email protected])1> ok.
ok
([email protected])2>
([email protected])2>


I'll be very interested if it unlocks the moment it prints the ok back to
the console.

Beware the ^D to leave if you direct attached - ^C will kill your server.

Jon




On Thu, Mar 27, 2014 at 10:38 AM, Michael Dillon <[email protected]
> wrote:

> I was running Riak 2.0pre11 but now see the same problem on pre20
>
> I've reduced the Riak cluster to one single node, to eliminate the
> inter-node communication from the issue. From another server I run a script
> to do 100,000 inserts using the Python client (presumable 1.4.something).
> Each insert is in a loop with 3 retries and it always specifies a server
> timeout value. For this test, the HTML docs are small enough that the
> 60,000 millisec default timeout value is always specified. Currently there
> is no socket timeout specified on the client side.
>
> Part way through, one of the inserts hung. On investigation the Riak
> server seemed in an OK state. I ran strace ps ax and it did not hang.
> strace riak-admin status also was OK. top showed one of the riak processes
> and strace -p PID showed that it was waiting in select. But then, after a
> retry, the client continued to do inserts. That particular strace showed no
> change so not sure whether the process was important. Ran top again and the
> same process showed 80% CPU utilization.
>
> Then we got a full hang of Riak. The client did not retry because the
> server did not timeout. It just hung and hung. Over 15 minutes as I write
> this. When I ran strace ps ax on the Riak server, it hung reading
> /proc/PID/cmdline where PID was the same as the one mentioned above. When I
> run pstree -p (which never hangs) it shows this
>
> |-run_erl(4171)---beam.smp(4173)-+-cpu_sup(4473)
>
> 4173 is the PID that I have been talking about. Oddly enough, when I
> opened a new SSH connection to this server, the strace ps ax which had been
> hung on opening a /proc file, suddenly ran to completion. However, running
> it again, hung again on the same line. Here are a few lines of strace ps ax
>
> stat("/proc/4173", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
>
> open("/proc/4173/stat", O_RDONLY)       = 6
>
> read(6, "4173 (beam.smp) S 4171 4173 4173"..., 1024) = 277
>
> read(6, "", 747)                        = 0
>
> close(6)                                = 0
>
> open("/proc/4173/status", O_RDONLY)     = 6
>
> read(6, "Name:\tbeam.smp\nState:\tS (sleepin"..., 1024) = 787
>
> read(6, "", 237)                        = 0
>
> close(6)                                = 0
>
> open("/proc/4173/cmdline", O_RDONLY)    = 6
>
> read(6,
>
> Any idea what is happening?
>
> When Riak is running normally, is there a way to identify a PID which
> would be useful to attach to strace if I see this problem developing? Or
> some other way to look at status of all the different beam.smp processes
> and identify where the problem is located?
>
> Doesn't this indicate a problem with the way that Riak implements the
> server timeout? Shouldn't some supervisor be killing and restarting a child
> process or subtree when this occurs?
> --
> Michael Dillon - Senior Software Engineer
> PageFreezer.com
> #200 - 311 Water Street
> Vancouver,  BC  V6B 1B8
>
> _______________________________________________
> riak-users mailing list
> [email protected]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>


-- 
Jon Meredith
VP, Engineering
Basho Technologies, Inc.
[email protected]
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to