Re: Wierd hanging problem

Michael Dillon Thu, 27 Mar 2014 09:54:07 -0700

This is what the output of top looks like on the Riak server while it is
hung:


top - 16:51:08 up  1:06,  2 users,  load average: 17.60, 21.76, 21.20

Tasks: 261 total,   7 running, 253 sleeping,   0 stopped,   1 zombie

Cpu(s):  0.0%us, 12.6%sy,  0.0%ni, 62.4%id, 25.0%wa,  0.0%hi,  0.0%si,
0.0%st

Mem:  30623232k total, 30403632k used,   219600k free,   184524k buffers

Swap:        0k total,        0k used,        0k free, 17739960k cached


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND


 6128 root      20   0 17476 1432  972 R    0  0.0   0:00.02 top


    1 root      20   0 24340 2308 1344 S    0  0.0   0:01.50 init


    2 root      20   0     0    0    0 S    0  0.0   0:00.01 kthreadd


    3 root      20   0     0    0    0 S    0  0.0   0:00.00 ksoftirqd/0


    4 root      20   0     0    0    0 S    0  0.0   0:01.06 kworker/0:0


On Thu, Mar 27, 2014 at 9:38 AM, Michael Dillon
<[email protected]>wrote:

> I was running Riak 2.0pre11 but now see the same problem on pre20
>
> I've reduced the Riak cluster to one single node, to eliminate the
> inter-node communication from the issue. From another server I run a script
> to do 100,000 inserts using the Python client (presumable 1.4.something).
> Each insert is in a loop with 3 retries and it always specifies a server
> timeout value. For this test, the HTML docs are small enough that the
> 60,000 millisec default timeout value is always specified. Currently there
> is no socket timeout specified on the client side.
>
> Part way through, one of the inserts hung. On investigation the Riak
> server seemed in an OK state. I ran strace ps ax and it did not hang.
> strace riak-admin status also was OK. top showed one of the riak processes
> and strace -p PID showed that it was waiting in select. But then, after a
> retry, the client continued to do inserts. That particular strace showed no
> change so not sure whether the process was important. Ran top again and the
> same process showed 80% CPU utilization.
>
> Then we got a full hang of Riak. The client did not retry because the
> server did not timeout. It just hung and hung. Over 15 minutes as I write
> this. When I ran strace ps ax on the Riak server, it hung reading
> /proc/PID/cmdline where PID was the same as the one mentioned above. When I
> run pstree -p (which never hangs) it shows this
>
> |-run_erl(4171)---beam.smp(4173)-+-cpu_sup(4473)
>
> 4173 is the PID that I have been talking about. Oddly enough, when I
> opened a new SSH connection to this server, the strace ps ax which had been
> hung on opening a /proc file, suddenly ran to completion. However, running
> it again, hung again on the same line. Here are a few lines of strace ps ax
>
> stat("/proc/4173", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
>
> open("/proc/4173/stat", O_RDONLY)       = 6
>
> read(6, "4173 (beam.smp) S 4171 4173 4173"..., 1024) = 277
>
> read(6, "", 747)                        = 0
>
> close(6)                                = 0
>
> open("/proc/4173/status", O_RDONLY)     = 6
>
> read(6, "Name:\tbeam.smp\nState:\tS (sleepin"..., 1024) = 787
>
> read(6, "", 237)                        = 0
>
> close(6)                                = 0
>
> open("/proc/4173/cmdline", O_RDONLY)    = 6
>
> read(6,
>
> Any idea what is happening?
>
> When Riak is running normally, is there a way to identify a PID which
> would be useful to attach to strace if I see this problem developing? Or
> some other way to look at status of all the different beam.smp processes
> and identify where the problem is located?
>
> Doesn't this indicate a problem with the way that Riak implements the
> server timeout? Shouldn't some supervisor be killing and restarting a child
> process or subtree when this occurs?
> --
> Michael Dillon - Senior Software Engineer
> PageFreezer.com
> #200 - 311 Water Street
> Vancouver,  BC  V6B 1B8
>



-- 
Michael Dillon - Senior Software Engineer
PageFreezer.com
#200 - 311 Water Street
Vancouver,  BC  V6B 1B8

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Wierd hanging problem

Reply via email to