I am seeing a lot of spurious I/O errors that look like they come from
the cache-facing side of btrfs.  While running a heavy load with some
extent-sharing (e.g. building 20 Linux kernels at once from source trees
copied with 'cp -a --reflink=always'), some files will return spurious
EIO on read.  It happens often enough to prevent a Linux kernel build
about 1/3 of the time.

I believe the I/O errors to be spurious because:

        - there is no kernel message of any kind during the event

        - scrub detects 0 errors

        - device stats report 0 errors

        - the drive firmware reports nothing wrong through SMART

        - there seems to be no attempt to read the disk when the error
        is reported

        - "sysctl vm.drop_caches={1,2}" makes the I/O error go away.

Files become unreadable at random, and stay unreadable indefinitely;
however, any time I discover a file that gives EIO on read, I can
poke vm.drop_caches and make the EIO go away.  The file can then be
read normally and has correct contents.  The disk does not seem to be
involved in the I/O error return.

This seems to happen more often when snapshots are being deleted;
however, it occurs on systems with no snapshots as well (though
in these cases the system had snapshots in the past).

When a file returns EIO on read, other snapshots of the same file also
return EIO on read.  I have not been able to test whether this affects
reflink copies (clones) as well.

Observed from 3.17..3.18.3.  All filesystems affected use skinny-metadata.
No filesystems that are not using skinny-metadata seem to have this
problem.

Attachment: signature.asc
Description: Digital signature

Reply via email to