Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth: > Am 26.02.2018 um 19:56 schrieb Gregory Farnum: >> >> >> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth >> <[email protected] <mailto:[email protected]>> wrote: >> >> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly: >> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth >> > <[email protected] <mailto:[email protected]>> >> wrote: >> >> Looking with: >> >> ceph daemon osd.2 perf dump >> >> I get: >> >> "bluefs": { >> >> "gift_bytes": 0, >> >> "reclaim_bytes": 0, >> >> "db_total_bytes": 84760592384, >> >> "db_used_bytes": 78920024064, >> >> "wal_total_bytes": 0, >> >> "wal_used_bytes": 0, >> >> "slow_total_bytes": 0, >> >> "slow_used_bytes": 0, >> >> so it seems this is almost exclusively RocksDB usage. >> >> >> >> Is this expected? >> > >> > Yes. The directory entries are stored in the omap of the objects. This >> > will be stored in the RocksDB backend of Bluestore. >> > >> >> Is there a recommendation on how much MDS storage is needed for a >> CephFS with 450 TB? >> > >> > It seems in the above test you're using about 1KB per inode (file). >> > Using that you can extrapolate how much space the data pool needs >> > based on your file system usage. (If all you're doing is filling the >> > file system with empty files, of course you're going to need an >> > unusually large metadata pool.) >> > >> Many thanks, this helps! >> We naturally hope our users will not do this, this stress test was a >> worst case - >> but the rough number (1 kB per inode) does indeed help a lot, and also >> the increase with modifications >> of the file as laid out by David. >> >> Is also the slow backfilling normal? >> Will such increase in storage (by many file modifications) at some point >> also be reduced, i.e. >> is the database compacted / can one trigger that / is there something >> like "SQL vacuum"? >> >> To also answer David's questions in parallel: >> - Concerning the slow backfill, I am only talking about the "metadata >> OSDs". >> They are fully SSD backed, and have no separate device for block.db / >> WAL. >> - I adjusted backfills up to 128 for those metadata OSDs, the cluster is >> currently fully empty, i.e. no client's are doing anything. >> There are no slow requests. >> Since no clients are doing anything and the rest of the cluster is now >> clean (apart from the two backfilling OSDs), >> right now there is also no memory pressure at all. >> The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load each. >> The OSDs being backfilled have 3.3 % CPU load, and have about 250 kB/s >> of write throughput. >> Network traffic between the node with the clean OSDs and the >> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly >> more bandwidth available... >> - Checking sleeps with: >> # ceph -n osd.1 --show-config | grep sleep >> osd_recovery_sleep = 0.000000 >> osd_recovery_sleep_hdd = 0.100000 >> osd_recovery_sleep_hybrid = 0.025000 >> osd_recovery_sleep_ssd = 0.000000 >> shows there should be 0 sleep. Or is there another way to query? >> >> >> Check if the OSDs are reporting their stores or their journals to be >> "rotational" via "ceph osd metadata"? > > I find: > "bluestore_bdev_model": "Micron_5100_MTFD", > "bluestore_bdev_partition_path": "/dev/sda2", > "bluestore_bdev_rotational": "0", > "bluestore_bdev_size": "239951482880", > "bluestore_bdev_type": "ssd", > [...] > "rotational": "0" > > for all of them (obviously with different device paths). > Also, they've been assigned the ssd device class automatically: > # ceph osd df | head > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS > > 0 ssd 0.21829 1.00000 223G 11310M 212G 4.94 0.94 0 > > 1 ssd 0.21829 1.00000 223G 11368M 212G 4.97 0.95 0 > > 2 ssd 0.21819 1.00000 223G 76076M 149G 33.25 6.35 128 > > 3 ssd 0.21819 1.00000 223G 76268M 148G 33.33 6.37 128 > > So this should not be the reason... >
Checking again with the nice "grep" expression from the other thread concerning
bluestore backfilling...
# ceph osd metadata | grep 'id\|rotational'
yields:
"id": 0,
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"
"id": 1,
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"
"id": 2,
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"
"id": 3,
"bluefs_db_rotational": "0",
"bluestore_bdev_rotational": "0",
"journal_rotational": "1",
"rotational": "0"
"id": 4,
"bluefs_db_rotational": "0",
"bluefs_slow_rotational": "1",
"bluestore_bdev_rotational": "1",
"journal_rotational": "1",
"rotational": "1"
0-3 are pure SSDs, there is no separate block.db device.
Is "journal_rotational" really relevant for bluestore, though?
If so, detection seems broken...
For comparison, 4 is a HDD with block.db on an SSD.
Cheers,
Oliver
>>
>> If that's being detected wrong, that would cause them to be using those
>> sleeps.
>> -Greg
>>
>>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
