there are not any monitors running on the new nodes. the monitors are on
separate nodes and running the 0.94.5 release. i spent some time thinking
about this last night as well and my thoughts went to the recency patches.
i wouldn't think that caused this but its the only thing that seems close.

mike

On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer <[email protected]> wrote:

>
> Hello,
>
> On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote:
>
> > something weird happened on one of the ceph clusters that i administer
> > tonight which resulted in virtual machines using rbd volumes seeing
> > corruption in multiple forms.
> >
> > when everything was fine earlier in the day, the cluster was a number of
> > storage nodes spread across 3 different roots in the crush map. the first
> > bunch of storage nodes have both hard drives and ssds in them with the
> > hard drives in one root and the ssds in another. there is a pool for
> > each and the pool for the ssds is a cache tier for the hard drives. the
> > last set of storage nodes were in a separate root with their own pool
> > that is being used for burn in testing.
> >
> > these nodes had run for a while with test traffic and we decided to move
> > them to the main root and pools. the main cluster is running 0.94.5 and
> > the new nodes got 0.94.6 due to them getting configured after that was
> > released. i removed the test pool and did a ceph osd crush move to move
> > the first node into the main cluster, the hard drives into the root for
> > that tier of storage and the ssds into the root and pool for the cache
> > tier. each set was done about 45 minutes apart and they ran for a couple
> > hours while performing backfill without any issue other than high load
> > on the cluster.
> >
> Since I glanced what your setup looks like from Robert's posts and yours I
> won't say the obvious thing, as you aren't using EC pools.
>
> > we normally run the ssd tier in the forward cache-mode due to the ssds we
> > have not being able to keep up with the io of writeback. this results in
> > io on the hard drives slowing going up and performance of the cluster
> > starting to suffer. about once a week, i change the cache-mode between
> > writeback and forward for short periods of time to promote actively used
> > data to the cache tier. this moves io load from the hard drive tier to
> > the ssd tier and has been done multiple times without issue. i normally
> > don't do this while there are backfills or recoveries happening on the
> > cluster but decided to go ahead while backfill was happening due to the
> > high load.
> >
> As you might recall, I managed to have "rados bench" break (I/O error) when
> doing these switches with Firefly on my crappy test cluster, but not with
> Hammer.
> However I haven't done any such switches on my production cluster with a
> cache tier, both because the cache pool hasn't even reached 50% capacity
> after 2 weeks of pounding and because I'm sure that everything will hold
> up when it comes to the first flushing.
>
> Maybe the extreme load (as opposed to normal VM ops) of your cluster
> during the backfilling triggered the same or a similar bug.
>
> > i tried this procedure to change the ssd cache-tier between writeback and
> > forward cache-mode and things seemed okay from the ceph cluster. about 10
> > minutes after the first attempt a changing the mode, vms using the ceph
> > cluster for their storage started seeing corruption in multiple forms.
> > the mode was flipped back and forth multiple times in that time frame
> > and its unknown if the corruption was noticed with the first change or
> > subsequent changes. the vms were having issues of filesystems having
> > errors and getting remounted RO and mysql databases seeing corruption
> > (both myisam and innodb). some of this was recoverable but on some
> > filesystems there was corruption that lead to things like lots of data
> > ending up in the lost+found and some of the databases were
> > un-recoverable (backups are helping there).
> >
> > i'm not sure what would have happened to cause this corruption. the
> > libvirt logs for the qemu processes for the vms did not provide any
> > output of problems from the ceph client code. it doesn't look like any
> > of the qemu processes had crashed. also, it has now been several hours
> > since this happened with no additional corruption noticed by the vms. it
> > doesn't appear that we had any corruption happen before i attempted the
> > flipping of the ssd tier cache-mode.
> >
> > the only think i can think of that is different between this time doing
> > this procedure vs previous attempts was that there was the one storage
> > node running 0.94.6 where the remainder were running 0.94.5. is is
> > possible that something changed between these two releases that would
> > have caused problems with data consistency related to the cache tier? or
> > otherwise? any other thoughts or suggestions?
> >
> What comes to mind in terms of these 2 versions is that .6 has working
> read recency, supposedly.
> Which (as well as Infernalis) exposed the bug(s) when running with EC
> backing pools.
>
> Some cache pool members acting upon the recency and others not might
> confuse things, but you'd think that this is a per OSD (PG) thing and
> objects not promoted being acted upon accordingly.
>
> Those new nodes had no monitors on them, rite?
>
> Christian
> > thanks in advance for any help you can provide.
> >
> > mike
>
>
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to