Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

Vlad Kopylov Tue, 13 Nov 2018 12:20:31 -0800

Or is it possible to mount one OSD directly for read file access?

v


On Sun, Nov 11, 2018 at 1:47 PM Vlad Kopylov <[email protected]> wrote:

> Maybe it is possible if done via gateway-nfs export?
> Settings for gateway allow read osd selection?
>
> v
>
> On Sun, Nov 11, 2018 at 1:01 AM Martin Verges <[email protected]>
> wrote:
>
>> Hello Vlad,
>>
>> If you want to read from the same data, then it ist not possible (as far
>> I know).
>>
>> --
>> Martin Verges
>> Managing director
>>
>> Mobile: +49 174 9335695
>> E-Mail: [email protected]
>> Chat: https://t.me/MartinVerges
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>>
>> Web: https://croit.io
>> YouTube: https://goo.gl/PGE1Bx
>>
>> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov <[email protected]>
>> geschrieben:
>>
>>> Maybe i missed something but FS is explicitly selecting pools to put
>>> files and metadata, like I did below.
>>> So if I create new pools - data in them will be different. If I apply
>>> the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs
>>> t01 - it will start using dc1 hosts
>>>
>>>
>>> ceph osd pool create cfs_data 100
>>> ceph osd pool create cfs_meta 100
>>> ceph fs new t01 cfs_data cfs_meta
>>> sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
>>> name=admin,secretfile=/home/mciadmin/admin.secret
>>>
>>> rule dc1_primary {
>>>         id 1
>>>         type replicated
>>>         min_size 1
>>>         max_size 10
>>>         step take dc1
>>>         step chooseleaf firstn 1 type host
>>>         step emit
>>>         step take dc2
>>>         step chooseleaf firstn -2 type host
>>>         step emit
>>>         step take dc3
>>>         step chooseleaf firstn -2 type host
>>>         step emit
>>> }
>>>
>>> On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov <[email protected]> wrote:
>>>
>>>> Just to confirm - it will still populate  3 copies in each datacenter?
>>>> Thought this map was to select where to write to, guess it does write
>>>> replication on the back end.
>>>>
>>>> I thought pools are completely separate and clients would not see each
>>>> others data?
>>>>
>>>> Thank you Martin!
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Nov 9, 2018 at 2:10 PM Martin Verges <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello Vlad,
>>>>>
>>>>> you can generate something like this:
>>>>>
>>>>> rule dc1_primary_dc2_secondary {
>>>>>         id 1
>>>>>         type replicated
>>>>>         min_size 1
>>>>>         max_size 10
>>>>>         step take dc1
>>>>>         step chooseleaf firstn 1 type host
>>>>>         step emit
>>>>>         step take dc2
>>>>>         step chooseleaf firstn 1 type host
>>>>>         step emit
>>>>>         step take dc3
>>>>>         step chooseleaf firstn -2 type host
>>>>>         step emit
>>>>> }
>>>>>
>>>>> rule dc2_primary_dc1_secondary {
>>>>>         id 2
>>>>>         type replicated
>>>>>         min_size 1
>>>>>         max_size 10
>>>>>         step take dc1
>>>>>         step chooseleaf firstn 1 type host
>>>>>         step emit
>>>>>         step take dc2
>>>>>         step chooseleaf firstn 1 type host
>>>>>         step emit
>>>>>         step take dc3
>>>>>         step chooseleaf firstn -2 type host
>>>>>         step emit
>>>>> }
>>>>>
>>>>> After you added such crush rules, you can configure the pools:
>>>>>
>>>>> ~ $ ceph osd pool set <pool_for_dc1> crush_ruleset 1
>>>>> ~ $ ceph osd pool set <pool_for_dc2> crush_ruleset 2
>>>>>
>>>>> Now you place your workload from dc1 to the dc1 pool, and workload
>>>>> from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
>>>>> your workload issn't that write intensive) and save some money in dc3
>>>>> as your client would always read from a SSD and write to Hybrid.
>>>>>
>>>>> Btw. all this could be done with a few simple clicks through our web
>>>>> frontend. Even if you want to export it via CephFS / NFS / .. it is
>>>>> possible to set it on a per folder level. Feel free to take a look at
>>>>> https://www.youtube.com/watch?v=V33f7ipw9d4 to see how easy it could
>>>>> be.
>>>>>
>>>>> --
>>>>> Martin Verges
>>>>> Managing director
>>>>>
>>>>> Mobile: +49 174 9335695
>>>>> E-Mail: [email protected]
>>>>> Chat: https://t.me/MartinVerges
>>>>>
>>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>> Com. register: Amtsgericht Munich HRB 231263
>>>>>
>>>>> Web: https://croit.io
>>>>> YouTube: https://goo.gl/PGE1Bx
>>>>>
>>>>>
>>>>> 2018-11-09 17:35 GMT+01:00 Vlad Kopylov <[email protected]>:
>>>>> > Please disregard pg status, one of test vms was down for some time
>>>>> it is
>>>>> > healing.
>>>>> > Question only how to make it read from proper datacenter
>>>>> >
>>>>> > If you have an example.
>>>>> >
>>>>> > Thanks
>>>>> >
>>>>> >
>>>>> > On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> Martin, thank you for the tip.
>>>>> >> googling ceph crush rule examples doesn't give much on rules, just
>>>>> static
>>>>> >> placement of buckets.
>>>>> >> this all seems to be for placing data, not to giving client in
>>>>> specific
>>>>> >> datacenter proper read osd
>>>>> >>
>>>>> >> maybe something wrong with placement groups?
>>>>> >>
>>>>> >> I added datacenter dc1 dc2 dc3
>>>>> >> Current replicated_rule is
>>>>> >>
>>>>> >> rule replicated_rule {
>>>>> >>         id 0
>>>>> >> type replicated
>>>>> >>         min_size 1
>>>>> >>         max_size 10
>>>>> >>         step take default
>>>>> >>         step chooseleaf firstn 0 type host
>>>>> >>         step emit
>>>>> >> }
>>>>> >>
>>>>> >> # buckets
>>>>> >> host ceph1 {
>>>>> >> id -3 # do not change unnecessarily
>>>>> >> id -2 class ssd # do not change unnecessarily
>>>>> >> # weight 1.000
>>>>> >> alg straw2
>>>>> >> hash 0 # rjenkins1
>>>>> >> item osd.0 weight 1.000
>>>>> >> }
>>>>> >> datacenter dc1 {
>>>>> >> id -9 # do not change unnecessarily
>>>>> >> id -4 class ssd # do not change unnecessarily
>>>>> >> # weight 1.000
>>>>> >> alg straw2
>>>>> >> hash 0 # rjenkins1
>>>>> >> item ceph1 weight 1.000
>>>>> >> }
>>>>> >> host ceph2 {
>>>>> >> id -5 # do not change unnecessarily
>>>>> >> id -6 class ssd # do not change unnecessarily
>>>>> >> # weight 1.000
>>>>> >> alg straw2
>>>>> >> hash 0 # rjenkins1
>>>>> >> item osd.1 weight 1.000
>>>>> >> }
>>>>> >> datacenter dc2 {
>>>>> >> id -10 # do not change unnecessarily
>>>>> >> id -8 class ssd # do not change unnecessarily
>>>>> >> # weight 1.000
>>>>> >> alg straw2
>>>>> >> hash 0 # rjenkins1
>>>>> >> item ceph2 weight 1.000
>>>>> >> }
>>>>> >> host ceph3 {
>>>>> >> id -7 # do not change unnecessarily
>>>>> >> id -12 class ssd # do not change unnecessarily
>>>>> >> # weight 1.000
>>>>> >> alg straw2
>>>>> >> hash 0 # rjenkins1
>>>>> >> item osd.2 weight 1.000
>>>>> >> }
>>>>> >> datacenter dc3 {
>>>>> >> id -11 # do not change unnecessarily
>>>>> >> id -13 class ssd # do not change unnecessarily
>>>>> >> # weight 1.000
>>>>> >> alg straw2
>>>>> >> hash 0 # rjenkins1
>>>>> >> item ceph3 weight 1.000
>>>>> >> }
>>>>> >> root default {
>>>>> >> id -1 # do not change unnecessarily
>>>>> >> id -14 class ssd # do not change unnecessarily
>>>>> >> # weight 3.000
>>>>> >> alg straw2
>>>>> >> hash 0 # rjenkins1
>>>>> >> item dc1 weight 1.000
>>>>> >> item dc2 weight 1.000
>>>>> >> item dc3 weight 1.000
>>>>> >> }
>>>>> >>
>>>>> >>
>>>>> >> #ceph pg dump
>>>>> >> dumped all
>>>>> >> version 29433
>>>>> >> stamp 2018-11-09 11:23:44.510872
>>>>> >> last_osdmap_epoch 0
>>>>> >> last_pg_scan 0
>>>>> >> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND
>>>>> BYTES    LOG
>>>>> >> DISK_LOG STATE                      STATE_STAMP
>>>>> VERSION
>>>>> >> REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB
>>>>> SCRUB_STAMP
>>>>> >> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN
>>>>> >> 1.5f          0                  0        0         0       0
>>>>>   0
>>>>> >> 0        0               active+clean 2018-11-09 04:35:32.320607
>>>>>   0'0
>>>>> >> 544:1317 [0,2,1]          0 [0,2,1]              0        0'0
>>>>> 2018-11-09
>>>>> >> 04:35:32.320561             0'0 2018-11-04 11:55:54.756115
>>>>>    0
>>>>> >> 2.5c        143                  0      143         0       0
>>>>> 19490267
>>>>> >> 461      461 active+undersized+degraded 2018-11-08 19:02:03.873218
>>>>> 508'461
>>>>> >> 544:2100   [2,1]          2   [2,1]              2    290'380
>>>>> 2018-11-07
>>>>> >> 18:58:43.043719          64'120 2018-11-05 14:21:49.256324
>>>>>    0
>>>>> >> .....
>>>>> >> sum 15239 0 2053 2659 0 2157615019 58286 58286
>>>>> >> OSD_STAT USED    AVAIL  TOTAL  HB_PEERS PG_SUM PRIMARY_PG_SUM
>>>>> >> 2        3.7 GiB 28 GiB 32 GiB    [0,1]    200             73
>>>>> >> 1        3.7 GiB 28 GiB 32 GiB    [0,2]    200             58
>>>>> >> 0        3.7 GiB 28 GiB 32 GiB    [1,2]    173             69
>>>>> >> sum       11 GiB 85 GiB 96 GiB
>>>>> >>
>>>>> >> #ceph pg map 2.5c
>>>>> >> osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
>>>>> >>
>>>>> >> #pg map 1.5f
>>>>> >> osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
>>>>> >>
>>>>> >>
>>>>> >> On Fri, Nov 9, 2018 at 2:21 AM Martin Verges <
>>>>> [email protected]>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Hello Vlad,
>>>>> >>>
>>>>> >>> Ceph clients connect to the primary OSD of each PG. If you create a
>>>>> >>> crush rule for building1 and one for building2 that takes a OSD
>>>>> from
>>>>> >>> the same building as the first one, your reads to the pool will
>>>>> always
>>>>> >>> be on the same building (if the cluster is healthy) and only write
>>>>> >>> request get replicated to the other building.
>>>>> >>>
>>>>> >>> --
>>>>> >>> Martin Verges
>>>>> >>> Managing director
>>>>> >>>
>>>>> >>> Mobile: +49 174 9335695
>>>>> >>> E-Mail: [email protected]
>>>>> >>> Chat: https://t.me/MartinVerges
>>>>> >>>
>>>>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>>> >>> CEO: Martin Verges - VAT-ID: DE310638492
>>>>> >>> Com. register: Amtsgericht Munich HRB 231263
>>>>> >>>
>>>>> >>> Web: https://croit.io
>>>>> >>> YouTube: https://goo.gl/PGE1Bx
>>>>> >>>
>>>>> >>>
>>>>> >>> 2018-11-09 4:54 GMT+01:00 Vlad Kopylov <[email protected]>:
>>>>> >>> > I am trying to test replicated ceph with servers in different
>>>>> >>> > buildings, and
>>>>> >>> > I have a read problem.
>>>>> >>> > Reads from one building go to osd in another building and vice
>>>>> versa,
>>>>> >>> > making
>>>>> >>> > reads slower then writes! Making read as slow as slowest node.
>>>>> >>> >
>>>>> >>> > Is there a way to
>>>>> >>> > - disable parallel read (so it reads only from the same osd node
>>>>> where
>>>>> >>> > mon
>>>>> >>> > is);
>>>>> >>> > - or give each client read restriction per osd?
>>>>> >>> > - or maybe strictly specify read osd on mount;
>>>>> >>> > - or have node read delay cap (for example if node time out is
>>>>> larger
>>>>> >>> > then 2
>>>>> >>> > ms then do not use such node for read as other replicas are
>>>>> available).
>>>>> >>> > - or ability to place Clients on the Crush map - so it
>>>>> understands that
>>>>> >>> > osd
>>>>> >>> > in - for example osd in the same data-center as client has
>>>>> preference,
>>>>> >>> > and
>>>>> >>> > pull data from it/them.
>>>>> >>> >
>>>>> >>> > Mounting with kernel client latest mimic.
>>>>> >>> >
>>>>> >>> > Thank you!
>>>>> >>> >
>>>>> >>> > Vlad
>>>>> >>> >
>>>>> >>> > _______________________________________________
>>>>> >>> > ceph-users mailing list
>>>>> >>> > [email protected]
>>>>> >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> >>> >
>>>>>
>>>>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

Reply via email to