Am 26.02.2018 um 20:31 schrieb Gregory Farnum: > On Mon, Feb 26, 2018 at 11:26 AM Oliver Freyermuth > <[email protected] <mailto:[email protected]>> wrote: > > Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth: > > Am 26.02.2018 um 19:56 schrieb Gregory Farnum: > >> > >> > >> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth > <[email protected] <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> wrote: > >> > >> Am 26.02.2018 um 16:59 schrieb Patrick Donnelly: > >> > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth > >> > <[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>> wrote: > >> >> Looking with: > >> >> ceph daemon osd.2 perf dump > >> >> I get: > >> >> "bluefs": { > >> >> "gift_bytes": 0, > >> >> "reclaim_bytes": 0, > >> >> "db_total_bytes": 84760592384, > >> >> "db_used_bytes": 78920024064, > >> >> "wal_total_bytes": 0, > >> >> "wal_used_bytes": 0, > >> >> "slow_total_bytes": 0, > >> >> "slow_used_bytes": 0, > >> >> so it seems this is almost exclusively RocksDB usage. > >> >> > >> >> Is this expected? > >> > > >> > Yes. The directory entries are stored in the omap of the > objects. This > >> > will be stored in the RocksDB backend of Bluestore. > >> > > >> >> Is there a recommendation on how much MDS storage is needed for > a CephFS with 450 TB? > >> > > >> > It seems in the above test you're using about 1KB per inode > (file). > >> > Using that you can extrapolate how much space the data pool needs > >> > based on your file system usage. (If all you're doing is filling > the > >> > file system with empty files, of course you're going to need an > >> > unusually large metadata pool.) > >> > > >> Many thanks, this helps! > >> We naturally hope our users will not do this, this stress test was > a worst case - > >> but the rough number (1 kB per inode) does indeed help a lot, and > also the increase with modifications > >> of the file as laid out by David. > >> > >> Is also the slow backfilling normal? > >> Will such increase in storage (by many file modifications) at some > point also be reduced, i.e. > >> is the database compacted / can one trigger that / is there > something like "SQL vacuum"? > >> > >> To also answer David's questions in parallel: > >> - Concerning the slow backfill, I am only talking about the > "metadata OSDs". > >> They are fully SSD backed, and have no separate device for > block.db / WAL. > >> - I adjusted backfills up to 128 for those metadata OSDs, the > cluster is currently fully empty, i.e. no client's are doing anything. > >> There are no slow requests. > >> Since no clients are doing anything and the rest of the cluster > is now clean (apart from the two backfilling OSDs), > >> right now there is also no memory pressure at all. > >> The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load > each. > >> The OSDs being backfilled have 3.3 % CPU load, and have about > 250 kB/s of write throughput. > >> Network traffic between the node with the clean OSDs and the > "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly > more bandwidth available... > >> - Checking sleeps with: > >> # ceph -n osd.1 --show-config | grep sleep > >> osd_recovery_sleep = 0.000000 > >> osd_recovery_sleep_hdd = 0.100000 > >> osd_recovery_sleep_hybrid = 0.025000 > >> osd_recovery_sleep_ssd = 0.000000 > >> shows there should be 0 sleep. Or is there another way to query? > >> > >> > >> Check if the OSDs are reporting their stores or their journals to be > "rotational" via "ceph osd metadata"? > > > > I find: > > "bluestore_bdev_model": "Micron_5100_MTFD", > > "bluestore_bdev_partition_path": "/dev/sda2", > > "bluestore_bdev_rotational": "0", > > "bluestore_bdev_size": "239951482880", > > "bluestore_bdev_type": "ssd", > > [...] > > "rotational": "0" > > > > for all of them (obviously with different device paths). > > Also, they've been assigned the ssd device class automatically: > > # ceph osd df | head > > ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS > > 0 ssd 0.21829 1.00000 223G 11310M 212G 4.94 0.94 0 > > 1 ssd 0.21829 1.00000 223G 11368M 212G 4.97 0.95 0 > > 2 ssd 0.21819 1.00000 223G 76076M 149G 33.25 6.35 128 > > 3 ssd 0.21819 1.00000 223G 76268M 148G 33.33 6.37 128 > > > > So this should not be the reason... > > > > Checking again with the nice "grep" expression from the other thread > concerning bluestore backfilling... > # ceph osd metadata | grep 'id\|rotational' > yields: > "id": 0, > "bluefs_db_rotational": "0", > "bluestore_bdev_rotational": "0", > "journal_rotational": "1", > "rotational": "0" > "id": 1, > "bluefs_db_rotational": "0", > "bluestore_bdev_rotational": "0", > "journal_rotational": "1", > "rotational": "0" > "id": 2, > "bluefs_db_rotational": "0", > "bluestore_bdev_rotational": "0", > "journal_rotational": "1", > "rotational": "0" > "id": 3, > "bluefs_db_rotational": "0", > "bluestore_bdev_rotational": "0", > "journal_rotational": "1", > "rotational": "0" > "id": 4, > "bluefs_db_rotational": "0", > "bluefs_slow_rotational": "1", > "bluestore_bdev_rotational": "1", > "journal_rotational": "1", > "rotational": "1" > 0-3 are pure SSDs, there is no separate block.db device. > Is "journal_rotational" really relevant for bluestore, though? > If so, detection seems broken... > > > For BlueStore it's using that config value to convey data about the WAL and > db. As with that thread, check if your OS is lying (they often do) about the > relevant block devices; at a quick skim the bluestore detection code looks > correct to me. > -Greg
That's strange, OS looks fine:
[root@mon002 ~]# cat /sys/block/sda/queue/rotational
0
[root@mon002 ~]# cat /sys/block/sdb/queue/rotational
0
Even stranger, as I said: "0-3 are pure SSDs, there is no separate block.db
device.".
So somehow, the detection for osd.0: "bluestore_bdev_rotational": "0" shows
it's not rotational,
while "journal_rotational": "1" says it is, while all data and block.db / WAL
live on the very same block device.
Cheers,
Oliver
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
