Hi Alexandre,
Thanks for your suggestion. I also considered using errors=continue, in
line with the nobarrier idea, but I was afraid it might lead to silent
corruption on errors not caused by slow requests on OSDs. I was hoping
for a solution that would specifically allow slowness of the block
device while still reacting to other errors, but I may have to accept
the lesser evil or keep investigating.
Cheers,
Paulo
On Mon, 2014-11-24 at 20:14 +0100, Alexandre DERUMIER wrote:
> Hi,
> try to mount your filesystems with errors=continue option
>
>
> From the mount (8) man page
>
> errors={continue|remount-ro|panic}
> Define the behaviour when an error is encountered. (Either ignore errors
> and just mark the filesystem erroneous and continue, or remount the
> filesystem read-only, or panic and halt the system.) The default is set in
> the filesystem superblock, and can be changed using tune2fs(8).
>
>
>
>
> ----- Mail original -----
>
> De: "Paulo Almeida" <[email protected]>
> À: [email protected]
> Envoyé: Lundi 24 Novembre 2014 17:06:40
> Objet: [ceph-users] Virtual machines using RBD remount read-only on OSD slow
> requests
>
> Hi,
>
> I have a Ceph cluster with 4 disk servers, 14 OSDs and replica size of
> 3. A number of KVM virtual machines are using RBD as their only storage
> device. Whenever some OSDs (always on a single server) have slow
> requests, caused, I believe, by flaky hardware or, in one occasion, by a
> S.M.A.R.T command that crashed the system disk of one of the disk
> servers, most virtual machines remount their disk read-only and need to
> be rebooted.
>
> One of the virtual machines still has Debian 6 installed, and it never
> crashes. It also has an ext3 filesystem, contrary to some other
> machines, which have ext4. ext3 does crash in systems with Debian 7, but
> those have different mount flags, such as "barrier" and "data=ordered".
> I suspect (but haven't tested) that using "nobarrier" may solve the
> problem, but that doesn't seem to be an ideal solution.
>
> Most of those machines have Debian 7 or Ubuntu 12.04, but two of them
> have Ubuntu 14.04 (and thus a more recent kernel) and they also remount
> read-only.
>
> I searched the mailing list and found a couple of relevant messages. One
> person seemed to have the same problem[1], but someone else replied that
> it didn't happen in his case ("I've had multiple VMs hang for hours at a
> time when I broke a Ceph cluster and after fixing it the VMs would start
> working again"). The other message[2] is not very informative.
>
> Are other people experiencing this problem? Is there a file system or
> kernel version that is recommended for KVM guests that would prevent it?
> Or does this problem indicate that something else is wrong and should be
> fixed? I did configure all machines to use "cache=writeback", but never
> investigated whether that makes a difference or even whether it is
> actually working.
>
> Thanks,
> Paulo Almeida
> Instituto Gulbenkian de Ciência, Oeiras, Portugal
>
>
> [1] http://thread.gmane.org/gmane.comp.file-systems.ceph.user/8011
> [2] http://thread.gmane.org/gmane.comp.file-systems.ceph.user/1742
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com