Thanks a lot Manuel for your findings and information.
It's good to know btrfs is not causing this issue and the common symptom
is an MD journal on another RAID device.
I have moved journal from logical volume on RAID1 to a plain partition
on a SSD and I will monitor the state.
Vojtech
On 17. 03. 21 5:35, Manuel Riel wrote:
Final update on this issue for anyone who encounters a similar problem in the
future:
I didn't observe any "hanging" RAID devices after using an ordinary NVMe
partition as journal. So using e.g. another md-RAID1 array as journal doesn't seem to be
supported.
The docs[1] say "This means the cache disk must be ... sustainable." The
sustainable part motivated me to use a md-RAID1 array. I think the docs should mention
that the journal can't be on another RAID array.
I'm sending in a patch to emphasize this in the docs.
1: https://www.kernel.org/doc/html/latest/driver-api/md/raid5-cache.html
On Feb 28, 2021, at 4:34 PM, Manuel Riel <m...@snapdragon.cc> wrote:
Hit another mdadm "hanger" today. No more reading possible and md4_raid6 stuck
at 100% CPU.
I've now moved the write journal off the RAID1 device. So it's not a "nested"
RAID any more. Hope this will help.
With only one hardware device used as write cache, I suppose only write-through
mode[1] is suggested now.
1: https://www.kernel.org/doc/Documentation/md/raid5-cache.txt