there are not any monitors running on the new nodes. the monitors are on separate nodes and running the 0.94.5 release. i spent some time thinking about this last night as well and my thoughts went to the recency patches. i wouldn't think that caused this but its the only thing that seems close.
mike On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer <[email protected]> wrote: > > Hello, > > On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote: > > > something weird happened on one of the ceph clusters that i administer > > tonight which resulted in virtual machines using rbd volumes seeing > > corruption in multiple forms. > > > > when everything was fine earlier in the day, the cluster was a number of > > storage nodes spread across 3 different roots in the crush map. the first > > bunch of storage nodes have both hard drives and ssds in them with the > > hard drives in one root and the ssds in another. there is a pool for > > each and the pool for the ssds is a cache tier for the hard drives. the > > last set of storage nodes were in a separate root with their own pool > > that is being used for burn in testing. > > > > these nodes had run for a while with test traffic and we decided to move > > them to the main root and pools. the main cluster is running 0.94.5 and > > the new nodes got 0.94.6 due to them getting configured after that was > > released. i removed the test pool and did a ceph osd crush move to move > > the first node into the main cluster, the hard drives into the root for > > that tier of storage and the ssds into the root and pool for the cache > > tier. each set was done about 45 minutes apart and they ran for a couple > > hours while performing backfill without any issue other than high load > > on the cluster. > > > Since I glanced what your setup looks like from Robert's posts and yours I > won't say the obvious thing, as you aren't using EC pools. > > > we normally run the ssd tier in the forward cache-mode due to the ssds we > > have not being able to keep up with the io of writeback. this results in > > io on the hard drives slowing going up and performance of the cluster > > starting to suffer. about once a week, i change the cache-mode between > > writeback and forward for short periods of time to promote actively used > > data to the cache tier. this moves io load from the hard drive tier to > > the ssd tier and has been done multiple times without issue. i normally > > don't do this while there are backfills or recoveries happening on the > > cluster but decided to go ahead while backfill was happening due to the > > high load. > > > As you might recall, I managed to have "rados bench" break (I/O error) when > doing these switches with Firefly on my crappy test cluster, but not with > Hammer. > However I haven't done any such switches on my production cluster with a > cache tier, both because the cache pool hasn't even reached 50% capacity > after 2 weeks of pounding and because I'm sure that everything will hold > up when it comes to the first flushing. > > Maybe the extreme load (as opposed to normal VM ops) of your cluster > during the backfilling triggered the same or a similar bug. > > > i tried this procedure to change the ssd cache-tier between writeback and > > forward cache-mode and things seemed okay from the ceph cluster. about 10 > > minutes after the first attempt a changing the mode, vms using the ceph > > cluster for their storage started seeing corruption in multiple forms. > > the mode was flipped back and forth multiple times in that time frame > > and its unknown if the corruption was noticed with the first change or > > subsequent changes. the vms were having issues of filesystems having > > errors and getting remounted RO and mysql databases seeing corruption > > (both myisam and innodb). some of this was recoverable but on some > > filesystems there was corruption that lead to things like lots of data > > ending up in the lost+found and some of the databases were > > un-recoverable (backups are helping there). > > > > i'm not sure what would have happened to cause this corruption. the > > libvirt logs for the qemu processes for the vms did not provide any > > output of problems from the ceph client code. it doesn't look like any > > of the qemu processes had crashed. also, it has now been several hours > > since this happened with no additional corruption noticed by the vms. it > > doesn't appear that we had any corruption happen before i attempted the > > flipping of the ssd tier cache-mode. > > > > the only think i can think of that is different between this time doing > > this procedure vs previous attempts was that there was the one storage > > node running 0.94.6 where the remainder were running 0.94.5. is is > > possible that something changed between these two releases that would > > have caused problems with data consistency related to the cache tier? or > > otherwise? any other thoughts or suggestions? > > > What comes to mind in terms of these 2 versions is that .6 has working > read recency, supposedly. > Which (as well as Infernalis) exposed the bug(s) when running with EC > backing pools. > > Some cache pool members acting upon the recency and others not might > confuse things, but you'd think that this is a per OSD (PG) thing and > objects not promoted being acted upon accordingly. > > Those new nodes had no monitors on them, rite? > > Christian > > thanks in advance for any help you can provide. > > > > mike > > > -- > Christian Balzer Network/Systems Engineer > [email protected] Global OnLine Japan/Rakuten Communications > http://www.gol.com/ >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
