Re: [ceph-users] Storage usage of CephFS-MDS

Oliver Freyermuth Mon, 26 Feb 2018 11:38:57 -0800

Am 26.02.2018 um 20:31 schrieb Gregory Farnum:
> On Mon, Feb 26, 2018 at 11:26 AM Oliver Freyermuth 
> <[email protected] <mailto:[email protected]>> wrote:
> 
>     Am 26.02.2018 um 20:09 schrieb Oliver Freyermuth:
>     > Am 26.02.2018 um 19:56 schrieb Gregory Farnum:
>     >>
>     >>
>     >> On Mon, Feb 26, 2018 at 8:25 AM Oliver Freyermuth 
> <[email protected] <mailto:[email protected]> 
> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote:
>     >>
>     >>     Am 26.02.2018 um 16:59 schrieb Patrick Donnelly:
>     >>     > On Sun, Feb 25, 2018 at 10:26 AM, Oliver Freyermuth
>     >>     > <[email protected] 
> <mailto:[email protected]> <mailto:[email protected] 
> <mailto:[email protected]>>> wrote:
>     >>     >> Looking with:
>     >>     >> ceph daemon osd.2 perf dump
>     >>     >> I get:
>     >>     >>     "bluefs": {
>     >>     >>         "gift_bytes": 0,
>     >>     >>         "reclaim_bytes": 0,
>     >>     >>         "db_total_bytes": 84760592384,
>     >>     >>         "db_used_bytes": 78920024064,
>     >>     >>         "wal_total_bytes": 0,
>     >>     >>         "wal_used_bytes": 0,
>     >>     >>         "slow_total_bytes": 0,
>     >>     >>         "slow_used_bytes": 0,
>     >>     >> so it seems this is almost exclusively RocksDB usage.
>     >>     >>
>     >>     >> Is this expected?
>     >>     >
>     >>     > Yes. The directory entries are stored in the omap of the 
> objects. This
>     >>     > will be stored in the RocksDB backend of Bluestore.
>     >>     >
>     >>     >> Is there a recommendation on how much MDS storage is needed for 
> a CephFS with 450 TB?
>     >>     >
>     >>     > It seems in the above test you're using about 1KB per inode 
> (file).
>     >>     > Using that you can extrapolate how much space the data pool needs
>     >>     > based on your file system usage. (If all you're doing is filling 
> the
>     >>     > file system with empty files, of course you're going to need an
>     >>     > unusually large metadata pool.)
>     >>     >
>     >>     Many thanks, this helps!
>     >>     We naturally hope our users will not do this, this stress test was 
> a worst case -
>     >>     but the rough number (1 kB per inode) does indeed help a lot, and 
> also the increase with modifications
>     >>     of the file as laid out by David.
>     >>
>     >>     Is also the slow backfilling normal?
>     >>     Will such increase in storage (by many file modifications) at some 
> point also be reduced, i.e.
>     >>     is the database compacted / can one trigger that / is there 
> something like "SQL vacuum"?
>     >>
>     >>     To also answer David's questions in parallel:
>     >>     - Concerning the slow backfill, I am only talking about the 
> "metadata OSDs".
>     >>       They are fully SSD backed, and have no separate device for 
> block.db / WAL.
>     >>     - I adjusted backfills up to 128 for those metadata OSDs, the 
> cluster is currently fully empty, i.e. no client's are doing anything.
>     >>       There are no slow requests.
>     >>       Since no clients are doing anything and the rest of the cluster 
> is now clean (apart from the two backfilling OSDs),
>     >>       right now there is also no memory pressure at all.
>     >>       The "clean" OSDs are reading with 7 MB/s each, with 5 % CPU load 
> each.
>     >>       The OSDs being backfilled have 3.3 % CPU load, and have about 
> 250 kB/s of write throughput.
>     >>       Network traffic between the node with the clean OSDs and the 
> "being-bbackfilled" OSDs is about 1.5 Mbit/s, while there is significantly 
> more bandwidth available...
>     >>     - Checking sleeps with:
>     >>     # ceph -n osd.1 --show-config | grep sleep
>     >>     osd_recovery_sleep = 0.000000
>     >>     osd_recovery_sleep_hdd = 0.100000
>     >>     osd_recovery_sleep_hybrid = 0.025000
>     >>     osd_recovery_sleep_ssd = 0.000000
>     >>     shows there should be 0 sleep. Or is there another way to query?
>     >>
>     >>
>     >> Check if the OSDs are reporting their stores or their journals to be 
> "rotational" via "ceph osd metadata"?
>     >
>     > I find:
>     >         "bluestore_bdev_model": "Micron_5100_MTFD",
>     >         "bluestore_bdev_partition_path": "/dev/sda2",
>     >         "bluestore_bdev_rotational": "0",
>     >         "bluestore_bdev_size": "239951482880",
>     >         "bluestore_bdev_type": "ssd",
>     > [...]
>     >         "rotational": "0"
>     >
>     > for all of them (obviously with different device paths).
>     > Also, they've been assigned the ssd device class automatically:
>     > # ceph osd df | head
>     > ID  CLASS WEIGHT  REWEIGHT SIZE  USE    AVAIL %USE  VAR  PGS
>     >   0   ssd 0.21829  1.00000  223G 11310M  212G  4.94 0.94   0
>     >   1   ssd 0.21829  1.00000  223G 11368M  212G  4.97 0.95   0
>     >   2   ssd 0.21819  1.00000  223G 76076M  149G 33.25 6.35 128
>     >   3   ssd 0.21819  1.00000  223G 76268M  148G 33.33 6.37 128
>     >
>     > So this should not be the reason...
>     >
> 
>     Checking again with the nice "grep" expression from the other thread 
> concerning bluestore backfilling...
>     # ceph osd metadata | grep 'id\|rotational'
>     yields:
>             "id": 0,
>             "bluefs_db_rotational": "0",
>             "bluestore_bdev_rotational": "0",
>             "journal_rotational": "1",
>             "rotational": "0"
>             "id": 1,
>             "bluefs_db_rotational": "0",
>             "bluestore_bdev_rotational": "0",
>             "journal_rotational": "1",
>             "rotational": "0"
>             "id": 2,
>             "bluefs_db_rotational": "0",
>             "bluestore_bdev_rotational": "0",
>             "journal_rotational": "1",
>             "rotational": "0"
>             "id": 3,
>             "bluefs_db_rotational": "0",
>             "bluestore_bdev_rotational": "0",
>             "journal_rotational": "1",
>             "rotational": "0"
>             "id": 4,
>             "bluefs_db_rotational": "0",
>             "bluefs_slow_rotational": "1",
>             "bluestore_bdev_rotational": "1",
>             "journal_rotational": "1",
>             "rotational": "1"
>     0-3 are pure SSDs, there is no separate block.db device.
>     Is "journal_rotational" really relevant for bluestore, though?
>     If so, detection seems broken...
> 
> 
> For BlueStore it's using that config value to convey data about the WAL and 
> db. As with that thread, check if your OS is lying (they often do) about the 
> relevant block devices; at a quick skim the bluestore detection code looks 
> correct to me.
> -Greg


That's strange, OS looks fine:
[root@mon002 ~]# cat /sys/block/sda/queue/rotational
0
[root@mon002 ~]# cat /sys/block/sdb/queue/rotational
0

Even stranger, as I said: "0-3 are pure SSDs, there is no separate block.db 
device.". 
So somehow, the detection for osd.0: "bluestore_bdev_rotational": "0" shows 
it's not rotational,
while "journal_rotational": "1" says it is, while all data and block.db / WAL 
live on the very same block device. 

Cheers,
        Oliver

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Storage usage of CephFS-MDS

Reply via email to