Our resharding was directly from unsharded to 1024 shards. One could imagine that an intermediate step would help, but I have no idea.

About bluefs size I would not know of something bad. Well maybe it was a problem that we had a very small (20 GB) device (SSD LV) for the DB which means spillover, but apart from being slow, it think that it should work. BTW I cannot check a lot, these OSDs are all removed now.

Cheers
 Harry

On 17.06.19 11:19, Dan van der Ster wrote:
We have resharded a bucket with 60 million objects from 32 to 64
shards without any problem. (Though there were several slow ops at the
"stalls after counting the objects phase", so I set nodown as a
precaution).
We're now resharding that bucket from 64 to 1024.

In your case I wonder if it was the large step up to 1024 shards that
caused the crashes somehow? Or maybe your bluefs didn't have enough
free space for the compaction after the large omaps were removed?

-- dan

On Mon, Jun 17, 2019 at 11:14 AM Harald Staub <harald.st...@switch.ch> wrote:

We received the large omap warning before, but for some reasons we could
not react quickly. We accepted the risk of the bucket becoming slow, but
had not thought of further risks ...

On 17.06.19 10:15, Dan van der Ster wrote:
Nice to hear this was resolved in the end.

Coming back to the beginning -- is it clear to anyone what was the
root cause and how other users can avoid this from happening? Maybe
some better default configs to warn users earlier about too-large
omaps?

Cheers, Dan

On Thu, Jun 13, 2019 at 7:36 PM Harald Staub <harald.st...@switch.ch> wrote:

Looks fine (at least so far), thank you all!

After having exported all 3 copies of the bad PG, we decided to try it
in-place. We also set norebalance to make sure that no data is moved.
When the PG was up, the resharding finished with a "success" message.
The large omap warning is gone after deep-scrubbing the PG.

Then we set the 3 OSDs to out. Soon after, one after the other was down
(maybe for 2 minutes) and we got degraded PGs, but only once.

Thank you!
    Harry

On 13.06.19 16:14, Sage Weil wrote:
On Thu, 13 Jun 2019, Harald Staub wrote:
On 13.06.19 15:52, Sage Weil wrote:
On Thu, 13 Jun 2019, Harald Staub wrote:
[...]
I think that increasing the various suicide timeout options will allow
it to stay up long enough to clean up the ginormous objects:

     ceph config set osd.NNN osd_op_thread_suicide_timeout 2h

ok

It looks healthy so far:
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-266 fsck
fsck success

Now we have to choose how to continue, trying to reduce the risk of losing
data (most bucket indexes are intact currently). My guess would be to let
this
OSD (which was not the primary) go in and hope that it recovers. In case
of a
problem, maybe we could still use the other OSDs "somehow"? In case of
success, we would bring back the other OSDs as well?

OTOH we could try to continue with the key dump from earlier today.

I would start all three osds the same way, with 'noout' set on the
cluster.  You should try to avoid triggering recovery because it will have
a hard time getting through the big index object on that bucket (i.e., it
will take a long time, and might trigger some blocked ios and so forth).

This I do not understand, how would I avoid recovery?

Well, simply doing 'ceph osd set noout' is sufficient to avoid
recovery, I suppose.  But in any case, getting at least 2 of the
existing copies/OSDs online (assuming your pool's min_size=2) will mean
you can finish the reshard process and clean up the big object without
copying the PG anywhere.

I think you may as well do all 3 OSDs this way, then clean up the big
object--that way in the end no data will have to move.

This is Nautilus, right?  If you scrub the PGs in question, that will also
now raise the health alert if there are any remaining big omap objects...
if that warning goes away you'll know you're doing cleaning up.  A final
rocksdb compaction should then be enough to remove any remaing weirdness
from rocksdb's internal layout.

(Side note that since you started the OSD read-write using the internal
copy of rocksdb, don't forget that the external copy you extracted
(/mnt/ceph/db?) is now stale!)

As suggested by Paul Emmerich (see next E-mail in this thread), I exported
this PG. It took not that long (20 minutes).

Great :)

sage

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to