something weird happened on one of the ceph clusters that i administer
tonight which resulted in virtual machines using rbd volumes seeing
corruption in multiple forms.

when everything was fine earlier in the day, the cluster was a number of
storage nodes spread across 3 different roots in the crush map. the first
bunch of storage nodes have both hard drives and ssds in them with the hard
drives in one root and the ssds in another. there is a pool for each and
the pool for the ssds is a cache tier for the hard drives. the last set of
storage nodes were in a separate root with their own pool that is being
used for burn in testing.

these nodes had run for a while with test traffic and we decided to move
them to the main root and pools. the main cluster is running 0.94.5 and the
new nodes got 0.94.6 due to them getting configured after that was
released. i removed the test pool and did a ceph osd crush move to move the
first node into the main cluster, the hard drives into the root for that
tier of storage and the ssds into the root and pool for the cache tier.
each set was done about 45 minutes apart and they ran for a couple hours
while performing backfill without any issue other than high load on the
cluster.

we normally run the ssd tier in the forward cache-mode due to the ssds we
have not being able to keep up with the io of writeback. this results in io
on the hard drives slowing going up and performance of the cluster starting
to suffer. about once a week, i change the cache-mode between writeback and
forward for short periods of time to promote actively used data to the
cache tier. this moves io load from the hard drive tier to the ssd tier and
has been done multiple times without issue. i normally don't do this while
there are backfills or recoveries happening on the cluster but decided to
go ahead while backfill was happening due to the high load.

i tried this procedure to change the ssd cache-tier between writeback and
forward cache-mode and things seemed okay from the ceph cluster. about 10
minutes after the first attempt a changing the mode, vms using the ceph
cluster for their storage started seeing corruption in multiple forms. the
mode was flipped back and forth multiple times in that time frame and its
unknown if the corruption was noticed with the first change or subsequent
changes. the vms were having issues of filesystems having errors and
getting remounted RO and mysql databases seeing corruption (both myisam and
innodb). some of this was recoverable but on some filesystems there was
corruption that lead to things like lots of data ending up in the
lost+found and some of the databases were un-recoverable (backups are
helping there).

i'm not sure what would have happened to cause this corruption. the libvirt
logs for the qemu processes for the vms did not provide any output of
problems from the ceph client code. it doesn't look like any of the qemu
processes had crashed. also, it has now been several hours since this
happened with no additional corruption noticed by the vms. it doesn't
appear that we had any corruption happen before i attempted the flipping of
the ssd tier cache-mode.

the only think i can think of that is different between this time doing
this procedure vs previous attempts was that there was the one storage node
running 0.94.6 where the remainder were running 0.94.5. is is possible that
something changed between these two releases that would have caused
problems with data consistency related to the cache tier? or otherwise? any
other thoughts or suggestions?

thanks in advance for any help you can provide.

mike
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to