Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

Pavan Rallabhandi Sat, 03 Nov 2018 08:56:20 -0700

Not exactly, this feature was supported in Jewel starting 10.2.11, ref 
https://github.com/ceph/ceph/pull/18010


I thought you mentioned you were using Luminous 12.2.4.

From: David Turner <drakonst...@gmail.com>
Date: Friday, November 2, 2018 at 5:21 PM
To: Pavan Rallabhandi <prallabha...@walmartlabs.com>
Cc: ceph-users <ceph-users@lists.ceph.com>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

That makes so much more sense. It seems like RHCS had had this ability since 
Jewel while it was only put into the community version as of Mimic. So my 
version of the version isn't actually capable of changing the backend db. Whole 
digging into the coffee I did find a bug with the creation of the rocksdb 
backend created with ceph-kvstore-tool. It doesn't use the ceph defaults or any 
settings in your config file for the db settings. I'm working on testing a 
modified version that should take those settings into account. If the fix does 
work, the fix will be able to apply to a few other tools as well that can be 
used to set up the omap backend db.

On Fri, Nov 2, 2018, 4:26 PM Pavan Rallabhandi 
<prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote:
It was Redhat versioned Jewel. But may be more relevantly, we are on Ubuntu 
unlike your case.

From: David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
Date: Friday, November 2, 2018 at 10:24 AM

To: Pavan Rallabhandi 
<prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

Pavan, which version of Ceph were you using when you changed your backend to 
rocksdb?

On Mon, Oct 1, 2018 at 4:24 PM Pavan Rallabhandi 
<prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> wrote:
Yeah, I think this is something to do with the CentOS binaries, sorry that I 
couldn’t be of much help here.

Thanks,
-Pavan.

From: David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
Date: Monday, October 1, 2018 at 1:37 PM
To: Pavan Rallabhandi 
<prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
Cc: ceph-users <ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the cluster 
unusable and takes forever

I tried modifying filestore_rocksdb_options by removing 
compression=kNoCompression as well as setting it to 
compression=kSnappyCompression.  Leaving it with kNoCompression or removing it 
results in the same segfault in the previous log.  Setting it to 
kSnappyCompression resulted in [1] this being logged and the OSD just failing 
to start instead of segfaulting.  Is there anything else you would suggest 
trying before I purge this OSD from the cluster?  I'm afraid it might be 
something with the CentOS binaries.

[1] 2018-10-01 17:10:37.134930 7f1415dfcd80  0  set rocksdb option compression 
= kSnappyCompression
2018-10-01 17:10:37.134986 7f1415dfcd80 -1 rocksdb: Invalid argument: 
Compression type Snappy is not linked with the binary.
2018-10-01 17:10:37.135004 7f1415dfcd80 -1 filestore(/var/lib/ceph/osd/ceph-1) 
mount(1723): Error initializing rocksdb :
2018-10-01 17:10:37.135020 7f1415dfcd80 -1 osd.1 0 OSD:init: unable to mount 
object store
2018-10-01 17:10:37.135029 7f1415dfcd80 -1 ESC[0;31m ** ERROR: osd init failed: 
(1) Operation not permittedESC[0m

On Sat, Sep 29, 2018 at 1:57 PM Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> 
wrote:
I looked at one of my test clusters running Jewel on Ubuntu 16.04, and 
interestingly I found this(below) in one of the OSD logs, which is different 
from your OSD boot log, where none of the compression algorithms seem to be 
supported. This hints more at how rocksdb was built on CentOS for Ceph.

2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb: Compression algorithms 
supported:
2018-09-29 17:38:38.629112 7fbd318d4b00  4 rocksdb:     Snappy supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb:     Zlib supported: 1
2018-09-29 17:38:38.629113 7fbd318d4b00  4 rocksdb:     Bzip supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb:     LZ4 supported: 0
2018-09-29 17:38:38.629114 7fbd318d4b00  4 rocksdb:     ZSTD supported: 0
2018-09-29 17:38:38.629115 7fbd318d4b00  4 rocksdb: Fast CRC32 supported: 0

On 9/27/18, 2:56 PM, "Pavan Rallabhandi" 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>> 
wrote:

    I see Filestore symbols on the stack, so the bluestore config doesn’t 
affect. And the top frame of the stack hints at a RocksDB issue, and there are 
a whole lot of these too:

    “2018-09-17 19:23:06.480258 7f1f3d2a7700  2 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.4/rpm/el7/BUILD/ceph-12.2.4/src/rocksdb/table/block_based_table_reader.cc:636]
 Cannot find Properties block from file.”

    It really seems to be something with RocksDB on centOS. I still think you 
can try removing “compression=kNoCompression” from the 
filestore_rocksdb_options And/Or check if rocksdb is expecting snappy to be 
enabled.

    Thanks,
    -Pavan.

    From: David Turner 
<mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
    Date: Thursday, September 27, 2018 at 1:18 PM
    To: Pavan Rallabhandi 
<mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
    Cc: ceph-users 
<mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    I got pulled away from this for a while.  The error in the log is "abort: 
Corruption: Snappy not supported or corrupted Snappy compressed block contents" 
and the OSD has 2 settings set to snappy by default, async_compressor_type and 
bluestore_compression_algorithm.  Do either of these settings affect the omap 
store?

    On Wed, Sep 19, 2018 at 2:33 PM Pavan Rallabhandi 
<mailto:mailto<mailto:mailto>:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
 wrote:
    Looks like you are running on CentOS, fwiw. We’ve successfully ran the 
conversion commands on Jewel, Ubuntu 16.04.

    Have a feel it’s expecting the compression to be enabled, can you try 
removing “compression=kNoCompression” from the filestore_rocksdb_options? 
And/or you might want to check if rocksdb is expecting snappy to be enabled.

    From: David Turner 
<mailto:mailto<mailto:mailto>:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
    Date: Tuesday, September 18, 2018 at 6:01 PM
    To: Pavan Rallabhandi 
<mailto:mailto<mailto:mailto>:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
    Cc: ceph-users 
<mailto:mailto<mailto:mailto>:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
    Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

    Here's the [1] full log from the time the OSD was started to the end of the 
crash dump.  These logs are so hard to parse.  Is there anything useful in them?

    I did confirm that all perms were set correctly and that the superblock was 
changed to rocksdb before the first time I attempted to start the OSD with it's 
new DB.  This is on a fully Luminous cluster with [2] the defaults you 
mentioned.

    [1] https://gist.github.com/drakonstein/fa3ac0ad9b2ec1389c957f95e05b79ed
    [2] "filestore_omap_backend": "rocksdb",
    "filestore_rocksdb_options": 
"max_background_compactions=8,compaction_readahead_size=2097152,compression=kNoCompression",

    On Tue, Sep 18, 2018 at 5:29 PM Pavan Rallabhandi 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
 wrote:
    I meant the stack trace hints that the superblock still has leveldb in it, 
have you verified that already?

    On 9/18/18, 5:27 PM, "Pavan Rallabhandi" 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
 wrote:

        You should be able to set them under the global section and that 
reminds me, since you are on Luminous already, I guess those values are already 
the default, you can verify from the admin socket of any OSD.

        But the stack trace didn’t hint as if the superblock on the OSD is 
still considering the omap backend to be leveldb and to do with the compression.

        Thanks,
        -Pavan.

        From: David Turner 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
        Date: Tuesday, September 18, 2018 at 5:07 PM
        To: Pavan Rallabhandi 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
        Cc: ceph-users 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
        Subject: EXT: Re: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

        Are those settings fine to have be global even if not all OSDs on a 
node have rocksdb as the backend?  Or will I need to convert all OSDs on a node 
at the same time?

        On Tue, Sep 18, 2018 at 5:02 PM Pavan Rallabhandi 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:mailto:prallabha...@walmartlabs.com<mailto:prallabha...@walmartlabs.com>>
 wrote:
        The steps that were outlined for conversion are correct, have you tried 
setting some the relevant ceph conf values too:

        filestore_rocksdb_options = 
"max_background_compactions=8;compaction_readahead_size=2097152;compression=kNoCompression"

        filestore_omap_backend = rocksdb

        Thanks,
        -Pavan.

        From: ceph-users 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>>
 on behalf of David Turner 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
        Date: Tuesday, September 18, 2018 at 4:09 PM
        To: ceph-users 
<mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:mailto:mailto<mailto:mailto>:mailto:ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
        Subject: EXT: [ceph-users] Any backfill in our cluster makes the 
cluster unusable and takes forever

        I've finally learned enough about the OSD backend track down this issue 
to what I believe is the root cause.  LevelDB compaction is the common thread 
every time we move data around our cluster.  I've ruled out PG subfolder 
splitting, EC doesn't seem to be the root cause of this, and it is cluster wide 
as opposed to specific hardware.

        One of the first things I found after digging into leveldb omap 
compaction was [1] this article with a heading "RocksDB instead of LevelDB" 
which mentions that leveldb was replaced with rocksdb as the default db backend 
for filestore OSDs and was even backported to Jewel because of the performance 
improvements.

        I figured there must be a way to be able to upgrade an OSD to use 
rocksdb from leveldb without needing to fully backfill the entire OSD.  There 
is [2] this article, but you need to have an active service account with RedHat 
to access it.  I eventually came across [3] this article about optimizing Ceph 
Object Storage which mentions a resolution to OSDs flapping due to omap 
compaction to migrate to using rocksdb.  It links to the RedHat article, but 
also has [4] these steps outlined in it.  I tried to follow the steps, but the 
OSD I tested this on was unable to start with [5] this segfault.  And then 
trying to move the OSD back to the original LevelDB omap folder resulted in [6] 
this in the log.  I apologize that all of my logging is with log level 1.  If 
needed I can get some higher log levels.

        My Ceph version is 12.2.4.  Does anyone have any suggestions for how I 
can update my filestore backend from leveldb to rocksdb?  Or if that's the 
wrong direction and I should be looking elsewhere?  Thank you.


        [1] https://ceph.com/community/new-luminous-rados-improvements/
        [2] https://access.redhat.com/solutions/3210951
        [3] 
https://hubb.blob.core.windows.net/c2511cea-81c5-4386-8731-cc444ff806df-public/resources/Optimize
 Ceph object storage for production in multisite clouds.pdf

        [4] ■ Stop the OSD
        ■ mv /var/lib/ceph/osd/ceph-/current/omap 
/var/lib/ceph/osd/ceph-/omap.orig
        ■ ulimit -n 65535
        ■ ceph-kvstore-tool leveldb /var/lib/ceph/osd/ceph-/omap.orig 
store-copy /var/lib/ceph/osd/ceph-/current/omap 10000 rocksdb
        ■ ceph-osdomap-tool --omap-path /var/lib/ceph/osd/ceph-/current/omap 
--command check
        ■ sed -i s/leveldb/rocksdb/g /var/lib/ceph/osd/ceph-/superblock
        ■ chown ceph.ceph /var/lib/ceph/osd/ceph-/current/omap -R
        ■ cd /var/lib/ceph/osd/ceph-; rm -rf omap.orig
        ■ Start the OSD

        [5] 2018-09-17 19:23:10.826227 7f1f3f2ab700 -1 abort: Corruption: 
Snappy not supported or corrupted Snappy compressed block contents
        2018-09-17 19:23:10.830525 7f1f3f2ab700 -1 *** Caught signal (Aborted) 
**

        [6] 2018-09-17 19:27:34.010125 7fcdee97cd80 -1 osd.0 0 OSD:init: unable 
to mount object store
        2018-09-17 19:27:34.010131 7fcdee97cd80 -1 ESC[0;31m ** ERROR: osd init 
failed: (1) Operation not permittedESC[0m
        2018-09-17 19:27:54.225941 7f7f03308d80  0 set uid:gid to 167:167 
(ceph:ceph)
        2018-09-17 19:27:54.225975 7f7f03308d80  0 ceph version 12.2.4 
(52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable), process 
(unknown), pid 361535
        2018-09-17 19:27:54.231275 7f7f03308d80  0 pidfile_write: ignore empty 
--pid-file
        2018-09-17 19:27:54.260207 7f7f03308d80  0 load: jerasure load: lrc 
load: isa
        2018-09-17 19:27:54.260520 7f7f03308d80  0 
filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)
        2018-09-17 19:27:54.261135 7f7f03308d80  0 
filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)
        2018-09-17 19:27:54.261750 7f7f03308d80  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl 
is disabled via 'filestore fiemap' config option
        2018-09-17 19:27:54.261757 7f7f03308d80  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
        2018-09-17 19:27:54.261758 7f7f03308d80  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice() is 
disabled via 'filestore splice' config option
        2018-09-17 19:27:54.286454 7f7f03308d80  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) 
syscall fully supported (by glibc and kernel)
        2018-09-17 19:27:54.286572 7f7f03308d80  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is 
disabled by conf
        2018-09-17 19:27:54.287119 7f7f03308d80  0 
filestore(/var/lib/ceph/osd/ceph-0) start omap initiation
        2018-09-17 19:27:54.287527 7f7f03308d80 -1 
filestore(/var/lib/ceph/osd/ceph-0) mount(1723): Error initializing leveldb : 
Corruption: VersionEdit: unknown tag

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Any backfill in our cluster makes the cluster unusable and takes forever

Reply via email to