This is what the output of top looks like on the Riak server while it is
hung:
top - 16:51:08 up 1:06, 2 users, load average: 17.60, 21.76, 21.20
Tasks: 261 total, 7 running, 253 sleeping, 0 stopped, 1 zombie
Cpu(s): 0.0%us, 12.6%sy, 0.0%ni, 62.4%id, 25.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 30623232k total, 30403632k used, 219600k free, 184524k buffers
Swap: 0k total, 0k used, 0k free, 17739960k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6128 root 20 0 17476 1432 972 R 0 0.0 0:00.02 top
1 root 20 0 24340 2308 1344 S 0 0.0 0:01.50 init
2 root 20 0 0 0 0 S 0 0.0 0:00.01 kthreadd
3 root 20 0 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
4 root 20 0 0 0 0 S 0 0.0 0:01.06 kworker/0:0
On Thu, Mar 27, 2014 at 9:38 AM, Michael Dillon
<[email protected]>wrote:
> I was running Riak 2.0pre11 but now see the same problem on pre20
>
> I've reduced the Riak cluster to one single node, to eliminate the
> inter-node communication from the issue. From another server I run a script
> to do 100,000 inserts using the Python client (presumable 1.4.something).
> Each insert is in a loop with 3 retries and it always specifies a server
> timeout value. For this test, the HTML docs are small enough that the
> 60,000 millisec default timeout value is always specified. Currently there
> is no socket timeout specified on the client side.
>
> Part way through, one of the inserts hung. On investigation the Riak
> server seemed in an OK state. I ran strace ps ax and it did not hang.
> strace riak-admin status also was OK. top showed one of the riak processes
> and strace -p PID showed that it was waiting in select. But then, after a
> retry, the client continued to do inserts. That particular strace showed no
> change so not sure whether the process was important. Ran top again and the
> same process showed 80% CPU utilization.
>
> Then we got a full hang of Riak. The client did not retry because the
> server did not timeout. It just hung and hung. Over 15 minutes as I write
> this. When I ran strace ps ax on the Riak server, it hung reading
> /proc/PID/cmdline where PID was the same as the one mentioned above. When I
> run pstree -p (which never hangs) it shows this
>
> |-run_erl(4171)---beam.smp(4173)-+-cpu_sup(4473)
>
> 4173 is the PID that I have been talking about. Oddly enough, when I
> opened a new SSH connection to this server, the strace ps ax which had been
> hung on opening a /proc file, suddenly ran to completion. However, running
> it again, hung again on the same line. Here are a few lines of strace ps ax
>
> stat("/proc/4173", {st_mode=S_IFDIR|0555, st_size=0, ...}) = 0
>
> open("/proc/4173/stat", O_RDONLY) = 6
>
> read(6, "4173 (beam.smp) S 4171 4173 4173"..., 1024) = 277
>
> read(6, "", 747) = 0
>
> close(6) = 0
>
> open("/proc/4173/status", O_RDONLY) = 6
>
> read(6, "Name:\tbeam.smp\nState:\tS (sleepin"..., 1024) = 787
>
> read(6, "", 237) = 0
>
> close(6) = 0
>
> open("/proc/4173/cmdline", O_RDONLY) = 6
>
> read(6,
>
> Any idea what is happening?
>
> When Riak is running normally, is there a way to identify a PID which
> would be useful to attach to strace if I see this problem developing? Or
> some other way to look at status of all the different beam.smp processes
> and identify where the problem is located?
>
> Doesn't this indicate a problem with the way that Riak implements the
> server timeout? Shouldn't some supervisor be killing and restarting a child
> process or subtree when this occurs?
> --
> Michael Dillon - Senior Software Engineer
> PageFreezer.com
> #200 - 311 Water Street
> Vancouver, BC V6B 1B8
>
--
Michael Dillon - Senior Software Engineer
PageFreezer.com
#200 - 311 Water Street
Vancouver, BC V6B 1B8
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com