Can you reproduce this with "debug osd = 20" and "debug ms = 1" set on
the OSD? I think we'll need that data to track down what exactly has
gone wrong here.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Mon, Mar 31, 2014 at 1:22 PM, Aaron Ten Clay <[email protected]> wrote:
> Hello fellow Cephers!
>
> Recently, before and after the update from 0.77 to 0.78, about half the OSDs
> in my cluster crash quite frequently with 'osd/PG.cc: 5255: FAILED assert(0
> == "we got a bad state machine event")'
>
> I'm not sure if this is a bug (there are some similar-sounding reports in
> Redmine already), or a configuration/corruption issue on my cluster.
>
> I've got 22 OSDs on 5 hosts, running 0.78 across the board. Any pointers
> would be appreciated! I'd like to track down and resolve the issue, it's
> causing a lot of stalled requests from clients and seems like a
> generally-unhealthy state of being.
>
> Here's a fresh log file (~3MiB) from one OSD that crashed (old log moved
> aside before restarting after crash):
> http://www.aarontc.com/logs/ceph-osd.4.log
>
> Thanks,
> -Aaron
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to