I wouldn't be shocked if they were out of space, but `ceph osd df` only
showed them as 45% full when I was first diagnosing this.  Now they are
showing completely full with the same command.  I'm thinking the cache tier
behavior might have changed to Luminous because I was keeping my cache
completely empty before with a max target objects of 0 which flushed things
out consistently after my minimum flush age.  I noticed it wasn't keeping
up with the flushing as well as it had in Jewel, but didn't think too much
of it.  Anyway, that's something I can tinker with after the pools are back
up and running.

If they are full and on Bluestore, what can I do to clean them up?  I
assume that I need to keep the metadata pool in-tact, but I don't need to
maintain any data in the cache pool.  I have a copy of everything written
in the last 24 hours prior to this incident and nothing is modified after
it is in cephfs.

On Fri, Jan 26, 2018 at 8:23 AM Nick Fisk <[email protected]> wrote:

> I can see this in the logs:
>
>
>
> 2018-01-25 06:05:56.292124 7f37fa6ea700 -1 log_channel(cluster) log [ERR]
> : full status failsafe engaged, dropping updates, now 101% full
>
> 2018-01-25 06:05:56.325404 7f3803f9c700 -1
> bluestore(/var/lib/ceph/osd/ceph-9) _do_alloc_write failed to reserve 0x4000
>
> 2018-01-25 06:05:56.325434 7f3803f9c700 -1
> bluestore(/var/lib/ceph/osd/ceph-9) _do_write _do_alloc_write failed with
> (28) No space left on device
>
> 2018-01-25 06:05:56.325462 7f3803f9c700 -1
> bluestore(/var/lib/ceph/osd/ceph-9) _txc_add_transaction error (28) No
> space left on device not handled on operation 10 (op 0, counting from 0)
>
>
>
> Are they out of space, or is something mis-reporting?
>
>
>
> Nick
>
>
>
> *From:* ceph-users [mailto:[email protected]] *On Behalf
> Of *David Turner
> *Sent:* 26 January 2018 13:03
> *To:* ceph-users <[email protected]>
> *Subject:* [ceph-users] BlueStore.cc: 9363: FAILED assert(0 ==
> "unexpected error")
>
>
>
> http://tracker.ceph.com/issues/22796
>
>
>
> I was curious if anyone here had any ideas or experience with this
> problem.  I created the tracker for this yesterday when I woke up to find
> all 3 of my SSD OSDs not running and unable to start due to this segfault.
> These OSDs are in my small home cluster and hold the cephfs_cache and
> cephfs_metadata pools.
>
>
>
> To recap, I upgraded from 10.2.10 to 12.2.2, successfully swapped out my 9
> OSDs to Bluestore, reconfigured my crush rules to utilize OSD classes,
> failed to remove the CephFS cache tier due to
> http://tracker.ceph.com/issues/22754, created these 3 SSD OSDs and
> updated the cephfs_cache and cephfs_metadata pools to use the
> replicated_ssd crush rule... fast forward 2 days of this working great to
> me waking up with all 3 of them crashed and unable to start.  There is an
> OSD log with debug bluestore = 5 attached to the tracker at the top of the
> email.
>
>
>
> My CephFS is completely down while these 2 pools are inaccessible.  The
> OSDs themselves are in-tact if I need to move the data out manually to the
> HDDs or something.  Any help is appreciated.
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to