On Mon, Oct 17, 2016 at 06:44:14PM +0200, Stefan Malte Schumacher wrote:
> Hello
> I would like to monitor my btrfs-filesystem for missing drives. On
> Debian mdadm uses a script in /etc/cron.daily, which calls mdadm and
> sends an email if anything is wrong with the array. I would like to do
> the same with btrfs. In my first attempt I grepped and cut the
> information from "btrfs fi show" and let the script send an email if
> the number of devices was not equal to the preselected number.
> Then I saw this:
> ubuntu@ubuntu:~$ sudo btrfs filesystem show
> Label: none  uuid: 67b4821f-16e0-436d-b521-e4ab2c7d3ab7
>     Total devices 6 FS bytes used 5.47TiB
>     devid    1 size 1.81TiB used 1.71TiB path /dev/sda3
>     devid    2 size 1.81TiB used 1.71TiB path /dev/sdb3
>     devid    3 size 1.82TiB used 1.72TiB path /dev/sdc1
>     devid    4 size 1.82TiB used 1.72TiB path /dev/sdd1
>     devid    5 size 2.73TiB used 2.62TiB path /dev/sde1
>     *** Some devices missing
> on this page: 
> https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
> The number of devices is still at 6, despite the fact that one of the
> drives is missing, which means that my first idea doesnt work. 

Using fi show for this isn't a good idea.  By the time btrfs fi show
tells you something is different from the norm, you've probably already
crashed at least once and are now mounting with the 'degraded' option.

> I have
> two questions:
> 1) Has anybody already written a script like this? After all, there is
> no need to reinvent the wheel a second time.
> 2) What should I best grep for? In this case I would just go for the
> "missing". Does this cover all possible outputs of btrfs fi show in
> case of a damaged array? What other outputs do I need to consider for
> my script.

I monitor the device error counters, i.e. the output of

        for fs in /fs1 /fs2 /fs3... ; do
                btrfs dev stat "$fs" | grep -v " 0$"

and send an email when it isn't empty.

When there are errors I investigate in more detail (is it a failing disk?
failed disk?  bad cables?  bad RAM?  One-off UNC sector that can be
ignored?), fix any problems (i.e. replace hardware, run scrub), and
reset the counters to zero with 'btrfs dev stat -z'.

> Yours sincerely
> Stefan
