[ceph-users] deep-scrub taking long time(possible leveldb corruption?)

Stanley Zhang Tue, 01 Aug 2017 17:18:07 -0700

Hi

We have a 4 physical nodes cluster running Jewel, our app talks S3 tothe cluster and uses S3 index heavily no-doubt. We've had several bigoutages in the past that seem caused by a deep-scrub on one of the PGsin S3 index pool. Generally it starts from a deep scrub on one such PGthen follows with lots of slow requests blocking/accumulating whicheventually makes the whole cluster down. In the event like this, we haveto set OSD to noup/nodown/noout to let OSD not suicide during suchdeep-scrub.

In a recent outage, the deep-scrub of one PG took 2 hours to finish,after finished, I happened to try listing all omap keys of the objectsin that PG and found that listing keys of one particular object cancause same outage described above, it indicates to me that the indexobject was corrupted, but I can't find anything in the logs.Interestingly (to me), 2 days later that index object seems have fixeditself: listing omap keys quick and easy, deep-scrubbing same PG onlytakes 3 seconds.


The deep-scrub that took 2 hours to finish:

xxxx.log-20170730.gz:2017-07-29 12:14:10.476325 osd.2 x.x.x.x:6800/78482217 : cluster [INF] 11.11 deep-scrub startsxxxx.log-20170730.gz:2017-07-29 14:05:12.108523 osd.2x.x.x.203:6800/78482 1795 : cluster [INF] 11.11 deep-scrub ok


The command I used to list all omap keys:

rados -p .rgw.buckets.index listomapkeys.dir.c82cdc62-7926-440d-8085-4e7879ef8155.26048.647 | wc -l


Most recent deep-scrub kicked off manually:

2017-07-31 09:54:37.997911 7f78bc333700 0 log_channel(cluster) log[INF] : 11.11 deep-scrub starts2017-07-31 09:54:40.539494 7f78bc333700 0 log_channel(cluster) log[INF] : 11.11 deep-scrub ok

Setting debug_leveldb to 20/5 didn't log any useful information for theevent, sorry, but a perf record shows most (83%) of the time was used onLevelDB operations (screenshot or perf file can be supplied if anybodyinterested since it's over 150KB size limit.).

I wonder if anybody came across similar issue before or can explain whathappened to the index object to make it not-usable before but usable 2days later? One thing that might fix the index object is leveldbcompactions I guess. By the way the above problematic index object has~30k keys, the biggest index object in our cluster holds about 300k keys.


Regards

Stanley

--

*Stanley Zhang | * Senior Operations Engineer
*Telephone:* +64 9 302 0515 *Fax:* +64 9 302 0518
*Mobile:* +64 22 318 3664 *Freephone:* 0800 SMX SMX (769 769)
*SMX Limited:* Level 15, 19 Victoria Street West, Auckland, New Zealand
*Web:* http://smxemail.com
SMX | Cloud Email Hosting & Security

_____________________________________________________________________________

This email has been filtered by SMX. For more info visit http://smxemail.com
_____________________________________________________________________________

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] deep-scrub taking long time(possible leveldb corruption?)

Reply via email to