I also just got my new SSDs that are 480GB if they could be used to move
the PGs to.  Thank you for your help.

On Fri, Jan 26, 2018 at 8:33 AM David Turner <[email protected]> wrote:

> If I could get it started, I could flush-evict the cache, but that's not
> seeming likely.
>
> On Fri, Jan 26, 2018 at 8:33 AM David Turner <[email protected]>
> wrote:
>
>> I wouldn't be shocked if they were out of space, but `ceph osd df` only
>> showed them as 45% full when I was first diagnosing this.  Now they are
>> showing completely full with the same command.  I'm thinking the cache tier
>> behavior might have changed to Luminous because I was keeping my cache
>> completely empty before with a max target objects of 0 which flushed things
>> out consistently after my minimum flush age.  I noticed it wasn't keeping
>> up with the flushing as well as it had in Jewel, but didn't think too much
>> of it.  Anyway, that's something I can tinker with after the pools are back
>> up and running.
>>
>> If they are full and on Bluestore, what can I do to clean them up?  I
>> assume that I need to keep the metadata pool in-tact, but I don't need to
>> maintain any data in the cache pool.  I have a copy of everything written
>> in the last 24 hours prior to this incident and nothing is modified after
>> it is in cephfs.
>>
>> On Fri, Jan 26, 2018 at 8:23 AM Nick Fisk <[email protected]> wrote:
>>
>>> I can see this in the logs:
>>>
>>>
>>>
>>> 2018-01-25 06:05:56.292124 7f37fa6ea700 -1 log_channel(cluster) log
>>> [ERR] : full status failsafe engaged, dropping updates, now 101% full
>>>
>>> 2018-01-25 06:05:56.325404 7f3803f9c700 -1
>>> bluestore(/var/lib/ceph/osd/ceph-9) _do_alloc_write failed to reserve 0x4000
>>>
>>> 2018-01-25 06:05:56.325434 7f3803f9c700 -1
>>> bluestore(/var/lib/ceph/osd/ceph-9) _do_write _do_alloc_write failed with
>>> (28) No space left on device
>>>
>>> 2018-01-25 06:05:56.325462 7f3803f9c700 -1
>>> bluestore(/var/lib/ceph/osd/ceph-9) _txc_add_transaction error (28) No
>>> space left on device not handled on operation 10 (op 0, counting from 0)
>>>
>>>
>>>
>>> Are they out of space, or is something mis-reporting?
>>>
>>>
>>>
>>> Nick
>>>
>>>
>>>
>>> *From:* ceph-users [mailto:[email protected]] *On
>>> Behalf Of *David Turner
>>> *Sent:* 26 January 2018 13:03
>>> *To:* ceph-users <[email protected]>
>>> *Subject:* [ceph-users] BlueStore.cc: 9363: FAILED assert(0 ==
>>> "unexpected error")
>>>
>>>
>>>
>>> http://tracker.ceph.com/issues/22796
>>>
>>>
>>>
>>> I was curious if anyone here had any ideas or experience with this
>>> problem.  I created the tracker for this yesterday when I woke up to find
>>> all 3 of my SSD OSDs not running and unable to start due to this segfault.
>>> These OSDs are in my small home cluster and hold the cephfs_cache and
>>> cephfs_metadata pools.
>>>
>>>
>>>
>>> To recap, I upgraded from 10.2.10 to 12.2.2, successfully swapped out my
>>> 9 OSDs to Bluestore, reconfigured my crush rules to utilize OSD classes,
>>> failed to remove the CephFS cache tier due to
>>> http://tracker.ceph.com/issues/22754, created these 3 SSD OSDs and
>>> updated the cephfs_cache and cephfs_metadata pools to use the
>>> replicated_ssd crush rule... fast forward 2 days of this working great to
>>> me waking up with all 3 of them crashed and unable to start.  There is an
>>> OSD log with debug bluestore = 5 attached to the tracker at the top of the
>>> email.
>>>
>>>
>>>
>>> My CephFS is completely down while these 2 pools are inaccessible.  The
>>> OSDs themselves are in-tact if I need to move the data out manually to the
>>> HDDs or something.  Any help is appreciated.
>>>
>>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to