I've had two osds fail and I'm pretty sure they wont recover from
this. I'm looking for help trying to get them back online if
possible...

terminate called after throwing an instance of 'ceph::buffer::malformed_input'
  what():  buffer::malformed_input: bad checksum on pg_log_entry_t

- I'm having this problem (http://pastebin.com/raw/jBp6YgUp) when starting
my osd.
- The source code related to this is here:
https://github.com/badone/ceph/blob/master/src/osd/osd_types.cc#L3422-3433
- The osd logs are here: http://pastebin.com/raw/PWwA0ae6

It seems that my osds were corrupted (unknown as to why), while leaving no
trace of problems in dmesg, smart or anything that xfs_repair could find.

These two OSD's are 6TB of my 40 TB array (triple replicated) and I'm
pretty sure I can't recover from it. I will know in about 10 hours
probably. Does anyone know anything I can try to repair my osds?

My notes on the situation:

- It can't find the superblock on first start after a reboot, no idea why.
It's there, I can see it and it doesn't complain after that.
- The two osds were bought at the same time and have similar serials, but
no bad smart stats or dmesg errors relating to them.
- The host these were installed to had a funky bios that was only reporting
half the ram it had in it. It doesn't have ECC memory. I have since
replaced the memory.
- xfs_repair has been run on both osds, nothing seems to have been found by
it and the problem  still persists.
- I have been at HEALTH_OK every day, but overnight scrubbing has been
uncovering problematic pgs I've had to repair ---- every single night so
far. This morning was when it went beyond my ability to repair.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to