Hi Jakub,
Le 05/02/2018 à 12:26, Jakub Jaszewski a écrit :
Hi Frederic,
Many thanks for your contribution to the topic!
I've just set logging level 20 for filestore via
ceph tell osd.0 config set debug_filestore 20
but so far
found
nothing by keyword 'split'
in /var/log/ceph/ceph-osd.0.log
So, if you're running ceph > 12.2.1, that means splitting is not
happening. Did you check during writes ? Did you check other OSDs logs ?
Actually, splitting should not happen now that you've increased
filestore_merge_threshold and filestore_split_multiple values.
I've also run your script across the cluster nodes, results as follows
id=3, pool=volumes, objects=10454548, avg=160.28
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.2344
id=3, pool=volumes, objects=10454548, avg=159.22
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.9994
id=3, pool=volumes, objects=10454548, avg=159.843
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7435
id=3, pool=volumes, objects=10454548, avg=159.695
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.0579
id=3, pool=volumes, objects=10454548, avg=160.594
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7757
id=3, pool=volumes, objects=10454548, avg=160.099
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=33.8517
id=3, pool=volumes, objects=10454548, avg=159.912
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=37.5698
id=3, pool=volumes, objects=10454548, avg=159.407
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.4991
id=3, pool=volumes, objects=10454548, avg=160.075
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.481
Looks like there is nothing to be handled by split, am I right? But
what about merging ? Avg is less than 40
, should directories structure be reduced now?
It should, I guess. But then you'd see blocked requests on every object
deletion. If you do, you might want to set filestore_merge_threshold
to -40 (negative value) so merging does not happen anymore.
Splitting would still happen over 5120 files per subdirectory.
"
filestore_merge_threshold": "40",
"filestore_split_multiple": "8",
"filestore_split_rand_factor": "20",
M
ay I ask for the link to documentation where I can read more about
OSD underlying directory structure?
I'm not aware of any related documentation.
Do you still observe slow or blocked requests now that you've increased
filestore_merge_threshold and filestore_split_multiple ?
Regards,
Frédéric.
And just noticed log entries in /var/log/ceph/ceph-osd.0.log
2018-02-05 11:22:03.346400 7f3cc94fe700 0 -- 10.212.14.11:6818/4702
<http://10.212.14.11:6818/4702> >> 10.212.14.17:6802/82845
<http://10.212.14.17:6802/82845> conn(0xe254cca800 :6818
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 27 vs existing csq=27
existing_state=STATE_STANDBY
2018-02-05 11:22:03.346583 7f3cc94fe700 0 -- 10.212.14.11:6818/4702
<http://10.212.14.11:6818/4702> >> 10.212.14.17:6802/82845
<http://10.212.14.17:6802/82845> conn(0xe254cca800 :6818
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg accept connect_seq 28 vs existing csq=27
existing_state=STATE_STANDBY
M
any thanks!
Jakub
On Mon, Feb 5, 2018 at 9:56 AM, Frédéric Nass
<frederic.n...@univ-lorraine.fr
<mailto:frederic.n...@univ-lorraine.fr>> wrote:
Hi,
In addition, starting with Luminous 12.2.1 (RHCS 3), splitting ops
should be loggued with default setting of debug level messages:
https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320
<https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320>
There's also a RFE for merging to be loggued as well as splitting:
https://bugzilla.redhat.com/show_bug.cgi?id=1523532
<https://bugzilla.redhat.com/show_bug.cgi?id=1523532>
Regards,
Frédéric.
Le 02/02/2018 à 17:00, Frédéric Nass a écrit :
Hi,
Split and merge operations happen during writes only, splitting
on file creation and merging on file deletion.
As you don't see any blocked requests during reads I would guess
your issue happens during splitting. Now that you increased
filestore_merge_threshold and filestore_split_multiple, you
shouldn't expect any splitting operations to happen any soon, nor
any merging operations, unless your workload consists of writing
a huge number of files and removing them.
You should check how many files are in each lower directories of
pool 20's PGs. This would help to confirm that the blocked
requests come with the splitting.
We now use the below script (on one of the OSD nodes) to get an
average value of the number of files in some PGs of each pool and
run this script every 5 minutes with Munin to get a graph out of
the values.
This way, we can anticipate the splitting and even provoke the
splitting during off hours with rados bench (prefix) :
#!/bin/bash
for pool in $(ceph osd pool ls detail | grep -v -E 'snap|^$' |
cut -d " " -f2-3 | sed -e 's/ /|/g') ; do
pool_id=$(echo $pool | cut -d "|" -f1)
pool_name=$(echo $pool | cut -d "|" -f2 | sed -e "s/'//g")
nbr_objects=$(rados df | grep "$pool_name " | awk '{print $3}')
#echo "$pool_id, $pool_name |$nbr_objects|"
ls -d /var/lib/ceph/osd/ceph-*/current/$pool_id.*_head >
/dev/null 2>&1
#echo $?
#continue
if [ $? -eq 0 ]; then
avg=$(for pg_dir in $(ls -d
/var/lib/ceph/osd/ceph-*/current/$pool_id.*_head | tail -5) ; do
find $pg_dir -type d -print0 | while read -d '' -r dir; do
files=`find $dir -type f -maxdepth 1|wc -l`; printf "$files in
$dir\n";done | grep -v '^0 ' | awk '{ sum += $1; n++ } END { if
(n > 0) print sum / n; }' ; done | awk '{ sum += $1; n++ } END {
if (n > 0) print sum / n; }')
else
avg=0
fi
echo "id=$pool_id, pool=$pool_name, objects=$nbr_objects, avg=$avg"
done
With 40/8 values, subdirectory splitting will happen when the
number of files goes over 5120 and merging when the number of
files goes below 40. My assumption is that merging can also
happen to intermediate directories when then go below 40 files,
but I'm not sure about that.
If you ever need to provoke splitting on a pool, you can do this
with a rados bench (don't forget the prefix so you can easily
remove the files aftewards if it turns out that some were left in
the pool).
You can also do the splitting offline if you need it:
systemctl stop ceph-osd@${osd_num}
ceph-osd -i ${osd_num} --flush-journal
ceph-objectstore-tool --data-path
/var/lib/ceph/osd/ceph-${osd_num} --journal-path
/var/lib/ceph/osd/ceph-${osd_num}/journal
--log-file=/var/log/ceph/objectstore_tool.${osd_num}.log --op
apply-layout-settings --pool ${pool_name}
Regards,
Frederic.
Le 02/02/2018 à 09:55, Jakub Jaszewski a écrit :
Hi,
So I have changed merge & split settings to
filestore_merge_threshold = 40
filestore_split_multiple = 8
and restart all OSDs , host by host.
Let me ask a question, although the pool
default.rgw.buckets.data that was affected prior to the above
change has higher write bandwidth it is very random now. Writes
are random for other pools (same for EC and replicated types)
too, before the change writes to replicated pools were much more
stable.
Reads from pools look fine and stable.
Is it the result of mentioned change ? Is PG directory structure
updating or ...?
[...]
Regards
Jakub
On Thu, Feb 1, 2018 at 3:33 PM, Jakub Jaszewski
<jaszewski.ja...@gmail.com <mailto:jaszewski.ja...@gmail.com>>
wrote:
Regarding split & merge, I have default values
filestore_merge_threshold = 10
filestore_split_multiple = 2
according to
https://bugzilla.redhat.com/show_bug.cgi?id=1219974
<https://bugzilla.redhat.com/show_bug.cgi?id=1219974> the
recommended values are
filestore_merge_threshold = 40
filestore_split_multiple = 8
Is it something that I can easily change to default or lower
values than proposed in case of further performance
degradation ?
I did tests of 4 pools: 2 replicated pools (x3 ) and 2 EC
pools (k=6,m=3)
The pool with the lowest bandwidth has osd tree structure like
├── 20.115s1_head
│ └── DIR_5
│ └── DIR_1
│ ├── DIR_1
│ │ ├── DIR_0
│ │ ├── DIR_1
│ │ ├── DIR_2
│ │ │ ├── DIR_0
│ │ │ ├── DIR_1
│ │ │ ├── DIR_2
│ │ │ ├── DIR_3
│ │ │ ├── DIR_4
│ │ │ ├── DIR_5
│ │ │ ├── DIR_6
│ │ │ ├── DIR_7
│ │ │ ├── DIR_8
│ │ │ ├── DIR_9
│ │ │ ├── DIR_A
│ │ │ ├── DIR_B
│ │ │ ├── DIR_C
│ │ │ ├── DIR_D
│ │ │ ├── DIR_E
│ │ │ └── DIR_F
Tests results
# rados bench -p default.rgw.buckets.data 10 write
hints = 1
Maintaining 16 concurrent writes of 4194432 bytes to objects
of size 4194432 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180679
sec Cur ops started finished avg MB/s cur MB/s last
lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 129 113 451.975 452.014 0.0376714
0.128027
2 16 209 193 385.964 320.01 0.119609
0.138517
3 16 235 219 291.974 104.003 0.0337624
0.13731
4 16 235 219 218.981 0 - 0.13731
5 16 266 250 199.983 62.0019 0.111673
0.238424
6 16 317 301 200.649 204.006 0.0340569
0.298489
7 16 396 380 217.124 316.01 0.0379956
0.283458
8 16 444 428 213.981 192.006 0.0304383
0.274193
9 16 485 469 208.426 164.005 0.391956
0.283421
10 16 496 480 191.983 44.0013 0.104497
0.292074
11 16 497 481 174.894 4.00012 0.999985
0.293545
12 16 497 481 160.32 0 -
0.293545
13 16 497 481 147.987 0 -
0.293545
14 16 497 481 137.417 0 -
0.293545
Total time run: 14.493353
Total writes made: 497
Write size: 4194432
Object size: 4194432
Bandwidth (MB/sec): 137.171
Stddev Bandwidth: 147.001
Max bandwidth (MB/sec): 452.014
Min bandwidth (MB/sec): 0
Average IOPS: 34
Stddev IOPS: 36
Max IOPS: 113
Min IOPS: 0
Average Latency(s): 0.464281
Stddev Latency(s): 1.09388
Max latency(s): 6.3723
Min latency(s): 0.023835
Cleaning up (deleting benchmark objects)
Removed 497 objects
Clean up completed and total clean up time :10.622382
#
# rados bench -p benchmark_erasure_coded 10 write
hints = 1
Maintaining 16 concurrent writes of 4202496 bytes to objects
of size 4202496 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180807
sec Cur ops started finished avg MB/s cur MB/s last
lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 424 408 1635.11 1635.19 0.0490434
0.0379616
2 16 828 812 1627.03 1619.16 0.0616501
0.0388467
3 16 1258 1242 1659.06 1723.36 0.0304412
0.0384537
4 16 1659 1643 1646.03 1607.13 0.0155402
0.0387351
5 16 2053 2037 1632.61 1579.08 0.0453354
0.0390236
6 16 2455 2439 1629 1611.14 0.0485313
0.0392376
7 16 2649 2633 1507.34 777.516 0.0148972
0.0393161
8 16 2858 2842 1423.61 837.633 0.0157639
0.0449088
9 16 3245 3229 1437.75 1551.02 0.0200845
0.0444847
10 16 3629 3613 1447.85 1539 0.0654451
0.0441569
Total time run: 10.229591
Total writes made: 3630
Write size: 4202496
Object size: 4202496
Bandwidth (MB/sec): 1422.18
Stddev Bandwidth: 341.609
Max bandwidth (MB/sec): 1723.36
Min bandwidth (MB/sec): 777.516
Average IOPS: 354
Stddev IOPS: 85
Max IOPS: 430
Min IOPS: 194
Average Latency(s): 0.0448612
Stddev Latency(s): 0.0712224
Max latency(s): 1.08353
Min latency(s): 0.0134629
Cleaning up (deleting benchmark objects)
Removed 3630 objects
Clean up completed and total clean up time :2.321669
#
# rados bench -p volumes 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects
of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180651
sec Cur ops started finished avg MB/s cur MB/s last
lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 336 320 1279.89 1280 0.0309006
0.0472524
2 16 653 637 1273.84 1268 0.0465151
0.0495701
3 16 956 940 1253.17 1212 0.0337327
0.0504146
4 16 1256 1240 1239.85 1200 0.0177263
0.0509145
5 16 1555 1539 1231.05 1196 0.0364991
0.0516724
6 16 1868 1852 1234.51 1252 0.0260964
0.0510236
7 16 2211 2195 1254.13 1372 0.040738
0.050847
8 16 2493 2477 1238.35 1128 0.0228582
0.0514979
9 16 2838 2822 1254.07 1380 0.0265224
0.0508641
10 16 3116 3100 1239.85 1112 0.0160151
0.0513104
Total time run: 10.192091
Total writes made: 3117
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1223.3
Stddev Bandwidth: 89.9383
Max bandwidth (MB/sec): 1380
Min bandwidth (MB/sec): 1112
Average IOPS: 305
Stddev IOPS: 22
Max IOPS: 345
Min IOPS: 278
Average Latency(s): 0.0518144
Stddev Latency(s): 0.0529575
Max latency(s): 0.663523
Min latency(s): 0.0122169
Cleaning up (deleting benchmark objects)
Removed 3117 objects
Clean up completed and total clean up time :0.212296
#
# rados bench -p benchmark_replicated 10 write
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects
of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_sg08-09_180779
sec Cur ops started finished avg MB/s cur MB/s last
lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 309 293 1171.94 1172 0.0233267
0.0508877
2 16 632 616 1231.87 1292 0.0258237
0.049612
3 16 959 943 1257.19 1308 0.0335615
0.0483464
4 16 1276 1260 1259.85 1268 0.031461
0.0504689
5 16 1643 1627 1301.44 1468 0.0274032
0.0489651
6 16 1991 1975 1316.51 1392 0.0408116
0.0483596
7 16 2328 2312 1320.98 1348 0.0242298
0.048175
8 16 2677 2661 1330.33 1396 0.097513
0.047962
9 16 3042 3026 1344.72 1460 0.0196724
0.0474078
10 16 3384 3368 1347.03 1368 0.0426199
0.0472573
Total time run: 10.482871
Total writes made: 3384
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 1291.25
Stddev Bandwidth: 90.4861
Max bandwidth (MB/sec): 1468
Min bandwidth (MB/sec): 1172
Average IOPS: 322
Stddev IOPS: 22
Max IOPS: 367
Min IOPS: 293
Average Latency(s): 0.048763
Stddev Latency(s): 0.0547666
Max latency(s): 0.938211
Min latency(s): 0.0121556
Cleaning up (deleting benchmark objects)
Removed 3384 objects
Clean up completed and total clean up time :0.239684
#
Luis why did you advise against increasing pg_num pgp_num ?
I'm wondering which option is better: increasing pg_num or
filestore_merge_threshold and filestore_split_multiple ?
Thanks
Jakub
On Thu, Feb 1, 2018 at 9:38 AM, Jaroslaw Owsiewski
<jaroslaw.owsiew...@allegro.pl
<mailto:jaroslaw.owsiew...@allegro.pl>> wrote:
Hi,
maybe "split is on the floor"?
Regards
--
Jarek
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com