Hi Jakub,

Le 05/02/2018 à 12:26, Jakub Jaszewski a écrit :
Hi Frederic,

Many thanks for your contribution to the topic!

I've just set logging level 20 for filestore via

ceph tell osd.0 config set debug_filestore 20

but so far
 nothing by keyword 'split'
​ in ​/var/log/ceph/ceph-osd.0.log

So, if you're running ceph > 12.2.1, that means splitting is not happening. Did you check during writes ? Did you check other OSDs logs ?

Actually, splitting should not happen now that you've increased ​​filestore_merge_threshold and filestore_split_multiple values.

​I've also run your script across the cluster nodes, results as follows

id=3, pool=volumes, objects=10454548, avg=160.28
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.2344
id=3, pool=volumes, objects=10454548, avg=159.22
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.9994
id=3, pool=volumes, objects=10454548, avg=159.843
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7435
id=3, pool=volumes, objects=10454548, avg=159.695
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.0579
id=3, pool=volumes, objects=10454548, avg=160.594
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7757
id=3, pool=volumes, objects=10454548, avg=160.099
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=33.8517
id=3, pool=volumes, objects=10454548, avg=159.912
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=37.5698
id=3, pool=volumes, objects=10454548, avg=159.407
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.4991
id=3, pool=volumes, objects=10454548, avg=160.075
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.481

Looks like there is nothing to be handled by split, am I right? But what about merging ? Avg is less than 40
 ​, should directories structure be reduced now?

It should, I guess. But then you'd see blocked requests on every object deletion. If you do, you might want to set ​​filestore_merge_threshold to -40 (negative value) so merging does not happen anymore.
Splitting would still happen over 5120 files per subdirectory.

filestore_merge_threshold": "40",
"filestore_split_multiple": "8",
"filestore_split_rand_factor": "20",

​ay I ask for the link to documentation where I can read more about OSD underlying directory structure?

I'm not aware of any related documentation.

Do you still observe slow or blocked requests now that you've increased ​​filestore_merge_threshold and filestore_split_multiple ?



​And just noticed log entries in /var/log/ceph/ceph-osd.0.log

​2018-02-05 11:22:03.346400 7f3cc94fe700  0 -- <> >> <> conn(0xe254cca800 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 27 vs existing csq=27 existing_state=STATE_STANDBY 2018-02-05 11:22:03.346583 7f3cc94fe700  0 -- <> >> <> conn(0xe254cca800 :6818 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 28 vs existing csq=27 existing_state=STATE_STANDBY

any thanks!​

On Mon, Feb 5, 2018 at 9:56 AM, Frédéric Nass <frederic.n...@univ-lorraine.fr <mailto:frederic.n...@univ-lorraine.fr>> wrote:


    In addition, starting with Luminous 12.2.1 (RHCS 3), splitting ops
    should be loggued with default setting of debug level messages:
    There's also a RFE for merging to be loggued as well as splitting:



    Le 02/02/2018 à 17:00, Frédéric Nass a écrit :


    Split and merge operations happen during writes only, splitting
    on file creation and merging on file deletion.

    As you don't see any blocked requests during reads I would guess
    your issue happens during splitting. Now that you increased
    filestore_merge_threshold and filestore_split_multiple, you
    shouldn't expect any splitting operations to happen any soon, nor
    any merging operations, unless your workload consists of writing
    a huge number of files and removing them.

    You should check how many files are in each lower directories of
    pool 20's PGs. This would help to confirm that the blocked
    requests come with the splitting.

    We now use the below script (on one of the OSD nodes) to get an
    average value of the number of files in some PGs of each pool and
    run this script every 5 minutes with Munin to get a graph out of
    the values.
    This way, we can anticipate the splitting and even provoke the
    splitting during off hours with rados bench (prefix) :


    for pool in $(ceph osd pool ls detail | grep -v -E 'snap|^$' |
    cut -d " " -f2-3 | sed -e 's/ /|/g') ; do

    pool_id=$(echo $pool | cut -d "|" -f1)
    pool_name=$(echo $pool | cut -d "|" -f2 | sed -e "s/'//g")
    nbr_objects=$(rados df | grep "$pool_name " | awk '{print $3}')
    #echo "$pool_id, $pool_name |$nbr_objects|"

    ls -d /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head >
    /dev/null 2>&1

    #echo $?

    if [ $? -eq 0 ]; then
    avg=$(for pg_dir in $(ls -d
    /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head | tail -5) ; do
    find $pg_dir -type d -print0 | while read -d '' -r dir; do
    files=`find $dir -type f -maxdepth 1|wc -l`; printf "$files in
    $dir\n";done | grep -v '^0 ' | awk '{ sum += $1; n++ } END { if
    (n > 0) print sum / n; }' ; done | awk '{ sum += $1; n++ } END {
    if (n > 0) print sum / n; }')

    echo "id=$pool_id, pool=$pool_name, objects=$nbr_objects, avg=$avg"


    With 40/8 values, subdirectory splitting will happen when the
    number of files goes over 5120 and merging when the number of
    files goes below 40. My assumption is that merging can also
    happen to intermediate directories when then go below 40 files,
    but I'm not sure about that.

    If you ever need to provoke splitting on a pool, you can do this
    with a rados bench (don't forget the prefix so you can easily
    remove the files aftewards if it turns out that some were left in
    the pool).
    You can also do the splitting offline if you need it:

    systemctl stop ceph-osd@${osd_num}
    ceph-osd -i ${osd_num} --flush-journal
    ceph-objectstore-tool --data-path
    /var/lib/ceph/osd/ceph-${osd_num} --journal-path
    --log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
    apply-layout-settings --pool ${pool_name}



    Le 02/02/2018 à 09:55, Jakub Jaszewski a écrit :

    So I have changed merge & split settings to
    filestore_merge_threshold = 40
    filestore_split_multiple = 8

    and restart all OSDs , host by host.

    Let me ask a question, although the pool
    default.rgw.buckets.data that was affected prior to the above
    change has higher write bandwidth it is very random now. Writes
    are random for other pools (same for EC and replicated types)
    too, before the change writes to replicated pools were much more
    Reads from pools look fine and stable.

    Is it the result of mentioned change ? Is PG directory structure
    updating or ...?



    On Thu, Feb 1, 2018 at 3:33 PM, Jakub Jaszewski
    <jaszewski.ja...@gmail.com <mailto:jaszewski.ja...@gmail.com>>

        Regarding split & merge, I have default values
        filestore_merge_threshold = 10
        filestore_split_multiple = 2

        according to
        <https://bugzilla.redhat.com/show_bug.cgi?id=1219974> the
        recommended values are

        filestore_merge_threshold = 40
        filestore_split_multiple = 8

        Is it something that I can easily change to default or lower
        values than proposed in case of further performance
        degradation ?

        I did tests of 4 pools: 2 replicated pools (x3 ) and 2 EC 
        pools (k=6,m=3)

        The pool with the lowest bandwidth has osd tree structure like
        ├── 20.115s1_head
        │   └── DIR_5
        │  └── DIR_1
        │  ├── DIR_1
        │  │   ├── DIR_0
        │  │   ├── DIR_1
        │  │   ├── DIR_2
        │  │   │   ├── DIR_0
        │  │   │   ├── DIR_1
        │  │   │   ├── DIR_2
        │  │   │   ├── DIR_3
        │  │   │   ├── DIR_4
        │  │   │   ├── DIR_5
        │  │   │   ├── DIR_6
        │  │   │   ├── DIR_7
        │  │   │   ├── DIR_8
        │  │   │   ├── DIR_9
        │  │   │   ├── DIR_A
        │  │   │   ├── DIR_B
        │  │   │   ├── DIR_C
        │  │   │   ├── DIR_D
        │  │   │   ├── DIR_E
        │    │   │   └── DIR_F

        Tests results

        # rados bench -p default.rgw.buckets.data 10 write
        hints = 1
        Maintaining 16 concurrent writes of 4194432 bytes to objects
        of size 4194432 for up to 10 seconds or 0 objects
        Object prefix: benchmark_data_sg08-09_180679
          sec Cur ops   started  finished avg MB/s  cur MB/s last
        lat(s)  avg lat(s)
        0       0         0         0        0         0  -           0
        1      16       129       113  451.975   452.014  0.0376714 
        2      16       209       193  385.964    320.01 0.119609   
        3      16       235       219  291.974   104.003  0.0337624 
        4      16       235       219  218.981         0  -     0.13731
        5      16       266       250  199.983   62.0019 0.111673   
        6      16       317       301  200.649   204.006  0.0340569 
        7      16       396       380  217.124    316.01  0.0379956 
        8      16       444       428  213.981   192.006  0.0304383 
        9      16       485       469  208.426   164.005 0.391956   
         10      16       496  480   191.983   44.0013 0.104497   
         11      16       497  481   174.894   4.00012 0.999985   
         12      16       497  481    160.32         0      -   
         13      16       497  481   147.987         0      -   
         14      16       497  481   137.417         0      -   
        Total time run:         14.493353
        Total writes made:      497
        Write size:             4194432
        Object size:            4194432
        Bandwidth (MB/sec):     137.171
        Stddev Bandwidth:       147.001
        Max bandwidth (MB/sec): 452.014
        Min bandwidth (MB/sec): 0
        Average IOPS:           34
        Stddev IOPS:            36
        Max IOPS:               113
        Min IOPS:               0
        Average Latency(s):     0.464281
        Stddev Latency(s):      1.09388
        Max latency(s):         6.3723
        Min latency(s):         0.023835
        Cleaning up (deleting benchmark objects)
        Removed 497 objects
        Clean up completed and total clean up time :10.622382

        # rados bench -p benchmark_erasure_coded 10 write
        hints = 1
        Maintaining 16 concurrent writes of 4202496 bytes to objects
        of size 4202496 for up to 10 seconds or 0 objects
        Object prefix: benchmark_data_sg08-09_180807
          sec Cur ops   started  finished avg MB/s  cur MB/s last
        lat(s)  avg lat(s)
        0       0         0         0        0         0  -           0
        1      16       424       408  1635.11   1635.19  0.0490434 
        2      16       828       812  1627.03   1619.16  0.0616501 
        3      16      1258      1242  1659.06   1723.36  0.0304412 
        4      16      1659      1643  1646.03   1607.13  0.0155402 
        5      16      2053      2037  1632.61   1579.08  0.0453354 
        6      16      2455      2439     1629   1611.14  0.0485313 
        7      16      2649      2633  1507.34   777.516  0.0148972 
        8      16      2858      2842  1423.61   837.633  0.0157639 
        9      16      3245      3229  1437.75   1551.02  0.0200845 
         10      16      3629 3613   1447.85      1539  0.0654451 
        Total time run:         10.229591
        Total writes made:      3630
        Write size:             4202496
        Object size:            4202496
        Bandwidth (MB/sec):     1422.18
        Stddev Bandwidth:       341.609
        Max bandwidth (MB/sec): 1723.36
        Min bandwidth (MB/sec): 777.516
        Average IOPS:           354
        Stddev IOPS:            85
        Max IOPS:               430
        Min IOPS:               194
        Average Latency(s):     0.0448612
        Stddev Latency(s):      0.0712224
        Max latency(s):         1.08353
        Min latency(s):         0.0134629
        Cleaning up (deleting benchmark objects)
        Removed 3630 objects
        Clean up completed and total clean up time :2.321669

        # rados bench -p volumes 10 write
        hints = 1
        Maintaining 16 concurrent writes of 4194304 bytes to objects
        of size 4194304 for up to 10 seconds or 0 objects
        Object prefix: benchmark_data_sg08-09_180651
          sec Cur ops   started  finished avg MB/s  cur MB/s last
        lat(s)  avg lat(s)
        0       0         0         0        0         0  -           0
        1      16       336       320  1279.89      1280  0.0309006 
        2      16       653       637  1273.84      1268  0.0465151 
        3      16       956       940  1253.17      1212  0.0337327 
        4      16      1256      1240  1239.85      1200  0.0177263 
        5      16      1555      1539  1231.05      1196  0.0364991 
        6      16      1868      1852  1234.51      1252  0.0260964 
        7      16      2211      2195  1254.13      1372 0.040738   
        8      16      2493      2477  1238.35      1128  0.0228582 
        9      16      2838      2822  1254.07      1380  0.0265224 
         10      16      3116 3100   1239.85      1112  0.0160151 
        Total time run:         10.192091
        Total writes made:      3117
        Write size:             4194304
        Object size:            4194304
        Bandwidth (MB/sec):     1223.3
        Stddev Bandwidth:       89.9383
        Max bandwidth (MB/sec): 1380
        Min bandwidth (MB/sec): 1112
        Average IOPS:           305
        Stddev IOPS:            22
        Max IOPS:               345
        Min IOPS:               278
        Average Latency(s):     0.0518144
        Stddev Latency(s):      0.0529575
        Max latency(s):         0.663523
        Min latency(s):         0.0122169
        Cleaning up (deleting benchmark objects)
        Removed 3117 objects
        Clean up completed and total clean up time :0.212296

        # rados bench -p benchmark_replicated  10 write
        hints = 1
        Maintaining 16 concurrent writes of 4194304 bytes to objects
        of size 4194304 for up to 10 seconds or 0 objects
        Object prefix: benchmark_data_sg08-09_180779
          sec Cur ops   started  finished avg MB/s  cur MB/s last
        lat(s)  avg lat(s)
        0       0         0         0        0         0  -           0
        1      16       309       293  1171.94      1172  0.0233267 
        2      16       632       616  1231.87      1292  0.0258237 
        3      16       959       943  1257.19      1308  0.0335615 
        4      16      1276      1260  1259.85      1268 0.031461 
        5      16      1643      1627  1301.44      1468  0.0274032 
        6      16      1991      1975  1316.51      1392  0.0408116 
        7      16      2328      2312  1320.98      1348  0.0242298 
        8      16      2677      2661  1330.33      1396 0.097513   
        9      16      3042      3026  1344.72      1460  0.0196724 
         10      16      3384 3368   1347.03      1368  0.0426199 
        Total time run:         10.482871
        Total writes made:      3384
        Write size:             4194304
        Object size:            4194304
        Bandwidth (MB/sec):     1291.25
        Stddev Bandwidth:       90.4861
        Max bandwidth (MB/sec): 1468
        Min bandwidth (MB/sec): 1172
        Average IOPS:           322
        Stddev IOPS:            22
        Max IOPS:               367
        Min IOPS:               293
        Average Latency(s):     0.048763
        Stddev Latency(s):      0.0547666
        Max latency(s):         0.938211
        Min latency(s):         0.0121556
        Cleaning up (deleting benchmark objects)
        Removed 3384 objects
        Clean up completed and total clean up time :0.239684

        Luis why did you advise against increasing pg_num pgp_num ?
        I'm wondering which option is better: increasing pg_num or
        filestore_merge_threshold and filestore_split_multiple ?


        On Thu, Feb 1, 2018 at 9:38 AM, Jaroslaw Owsiewski
        <mailto:jaroslaw.owsiew...@allegro.pl>> wrote:


            maybe "split  is on the floor"?

-- Jarek

    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>

    ceph-users mailing list
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>

ceph-users mailing list

Reply via email to