Hi,

Our servers run some daemons that are scheduled to run many real time
threads. These threads serve the client nodes by performing I/O on top
of some set of disks, configured as DRBD pairs with disks on other
peer servers for high availability of data. Btrfs is the filesystem
that is configured on top of DRBD.

While testing high availability with fairly high load, we have noticed
the following behaviour a couple of times: When the server which was
killed comes back up and gets ready and DRBD disks start syncing the
data between the disks, a performance hit is generally expected at the
peer node which has taken over the service now. However, the real time
threads (mentioned above) on the active node are hogging the CPUs. As
a part of debugging the issue, we tried to force a core dump on these
threads by using a SIGABRT. However, these threads were not responding
to any signals. Only after using real-time throttling (to reduce real
time CPU usage to 50%), and waiting around for a few minutes, we were
able to force a core dump. However, the corefile generated didn't have
much useful info (I think it was a partial/corrupted core dump).

Based on the above behaviour, (signals not being picked up), it looks
to me like all these threads were likely stuck inside some system
call. And since majority of the system calls by these threads are VFS
calls on btrfs, I feel that these threads may have been stuck in some
I/O. Specifically, based on the CPU usage, in some spinlock (I'm open
to suggestions of other possibilities). And this is the reason I'm
posting on this mailing list.

Is there a known bug which might have caused this? Kernel version
we're using is 4.4.0.
If we go for a kernel upgrade, what are the chances of not seeing this
behaviour again?

Or is my analysis of the problem entirely wrong? My feeling is that
this maybe some issue with using Btrfs when it doesn't get a response
from DRBD quickly enough.
Because we have been using ext4 on top of DRBD for a long time, and
have never seen such issues during HA tests there.

-- 
-Shyam
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to