Hi again,
I'm starting to feel really unlucky here...
At the moment, the situation is "sort of okay":
1387 active+clean
11 active+clean+inconsistent
7 active+recovery_wait+degraded
1 active+recovery_wait+undersized+degraded+remapped
1 active+undersized+degraded+remapped+wait_backfill
1
active+undersized+degraded+remapped+inconsistent+backfilling
To ensure nothing is in the way, I disabled both scrubbing and deep
scrubbing for the time being.
However, random OSDs (still on Hammer) constantly crash giving the error
as mentioned earlier (osd/ReplicatedPG.cc: 10115: FAILED assert(r >= 0)).
It felt like they started crashing when hitting the PG currently
backfilling, so I set the nobackfill flag.
For now, the crashing seems to have stopped. However, the cluster seems
slow at the moment when trying to access the given PG via KVM/QEMU (RBD).
Recap:
* All monitors run Infernalis.
* One OSD node runs Infernalis.
* All other OSD nodes run Hammer.
* One OSD on Infernalis is set to "out" and is stopped. This OSD
seemed to contain one inconsistent PG.
* Backfilling started.
* After hours and hours of backfilling, OSDs started to crash.
Other than restarting the "out" and stopped OSD for the time being
(haven't tried that yet) I'm quite lost.
Hopefully someone has some pointers for me.
Regards,
Kees
On 20-08-18 13:23, Kees Meijs wrote:
The given PG is back online, phew...
Meanwhile, some OSDs still on Hammer seem to crash with errors alike:
2018-08-20 13:06:33.819569 7f8962b2f700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::scan_range(int, int,
PG::BackfillInterval*, ThreadPool::TPHandle&)' thread 7f8962b2f700
time 2018-08-20 13:06:33.709922
osd/ReplicatedPG.cc: 10115: FAILED assert(r >= 0)
Restarting the OSDs seems to work.
K.
On 20-08-18 13:14, Kees Meijs wrote:
Bad news: I've got a PG stuck in down+peering now.
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com