Thank You Mark , please see my response below.
On Wed, Sep 2, 2015 at 5:23 PM, Mark Nelson <[email protected]> wrote:
> On 09/02/2015 08:51 AM, Vickey Singh wrote:
>
>> Hello Ceph Experts
>>
>> I have a strange problem , when i am reading or writing to Ceph pool ,
>> its not writing properly. Please notice Cur MB/s which is going up and
>> down
>>
>> --- Ceph Hammer 0.94.2
>> -- CentOS 6, 2.6
>> -- Ceph cluster is healthy
>>
>
> You might find that CentOS7 gives you better performance. In some cases
> we were seeing nearly 2X.
Wooo 2X , i would definitely plan for upgrade. Thanks
>
>
>
>>
>> One interesting thing is when every i start rados bench command for read
>> or write CPU Idle % goes down ~10 and System load is increasing like
>> anything.
>>
>> Hardware
>>
>> HpSL4540
>>
>
> Please make sure the controller is on the newest firmware. There used to
> be a bug that would cause sequential write performance to bottleneck when
> writeback cache was enabled on the RAID controller.
Last month i have upgraded the firmwares for this hardware , so i hope they
are up to date.
>
>
> 32Core CPU
>> 196G Memory
>> 10G Network
>>
>
> Be sure to check the network too. We've seen a lot of cases where folks
> have been burned by one of the NICs acting funky.
>
At a first view , Interface looks good and they are pushing data nicely (
what ever they are getting )
>
>
>> I don't think hardware is a problem.
>>
>> Please give me clues / pointers , how should i troubleshoot this problem.
>>
>>
>>
>> # rados bench -p glance-test 60 write
>> Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds
>> or 0 objects
>> Object prefix: benchmark_data_pouta-s01.pouta.csc.fi_2173350
>> sec Cur ops started finished avg MB/s cur MB/s last lat avg
>> lat
>> 0 0 0 0 0 0 -
>> 0
>> 1 16 20 4 15.99 16 0.12308
>> 0.10001
>> 2 16 37 21 41.9841 68 1.79104
>> 0.827021
>> 3 16 68 52 69.3122 124 0.084304
>> 0.854829
>> 4 16 114 98 97.9746 184 0.12285
>> 0.614507
>> 5 16 188 172 137.568 296 0.210669
>> 0.449784
>> 6 16 248 232 154.634 240 0.090418
>> 0.390647
>> 7 16 305 289 165.11 228 0.069769
>> 0.347957
>> 8 16 331 315 157.471 104 0.026247
>> 0.3345
>> 9 16 361 345 153.306 120 0.082861
>> 0.320711
>> 10 16 380 364 145.575 76 0.027964
>> 0.310004
>> 11 16 393 377 137.067 52 3.73332
>> 0.393318
>> 12 16 448 432 143.971 220 0.334664
>> 0.415606
>> 13 16 476 460 141.508 112 0.271096
>> 0.406574
>> 14 16 497 481 137.399 84 0.257794
>> 0.412006
>> 15 16 507 491 130.906 40 1.49351
>> 0.428057
>> 16 16 529 513 115.042 88 0.399384
>> 0.48009
>> 17 16 533 517 94.6286 16 5.50641
>> 0.507804
>> 18 16 537 521 83.405 16 4.42682
>> 0.549951
>> 19 16 538 522 80.349 4 11.2052
>> 0.570363
>> 2015-09-02 09:26:18.398641min lat: 0.023851 max lat: 11.2052 avg lat:
>> 0.570363
>> sec Cur ops started finished avg MB/s cur MB/s last lat avg
>> lat
>> 20 16 538 522 77.3611 0 -
>> 0.570363
>> 21 16 540 524 74.8825 4 8.88847
>> 0.591767
>> 22 16 542 526 72.5748 8 1.41627
>> 0.593555
>> 23 16 543 527 70.2873 4 8.0856
>> 0.607771
>> 24 16 555 539 69.5674 48 0.145199
>> 0.781685
>> 25 16 560 544 68.0177 20 1.4342
>> 0.787017
>> 26 16 564 548 66.4241 16 0.451905
>> 0.78765
>> 27 16 566 550 64.7055 8 0.611129
>> 0.787898
>> 28 16 570 554 63.3138 16 2.51086
>> 0.797067
>> 29 16 570 554 61.5549 0 -
>> 0.797067
>> 30 16 572 556 60.1071 4 7.71382
>> 0.830697
>> 31 16 577 561 59.0515 20 23.3501
>> 0.916368
>> 32 16 590 574 58.8705 52 0.336684
>> 0.956958
>> 33 16 591 575 57.4986 4 1.92811
>> 0.958647
>> 34 16 591 575 56.0961 0 -
>> 0.958647
>> 35 16 591 575 54.7603 0 -
>> 0.958647
>> 36 16 597 581 54.0447 8 0.187351
>> 1.00313
>> 37 16 625 609 52.8394 112 2.12256
>> 1.09256
>> 38 16 631 615 52.227 24 1.57413
>> 1.10206
>> 39 16 638 622 51.7232 28 4.41663
>> 1.15086
>> 2015-09-02 09:26:40.510623min lat: 0.023851 max lat: 27.6704 avg lat:
>> 1.15657
>> sec Cur ops started finished avg MB/s cur MB/s last lat avg
>> lat
>> 40 16 652 636 51.8102 56 0.113345
>> 1.15657
>> 41 16 682 666 53.1443 120 0.041251
>> 1.17813
>> 42 16 685 669 52.3395 12 0.501285
>> 1.17421
>> 43 15 690 675 51.7955 24 2.26605
>> 1.18357
>> 44 16 728 712 53.6062 148 0.589826
>> 1.17478
>> 45 16 728 712 52.6158 0 -
>> 1.17478
>> 46 16 728 712 51.6613 0 -
>> 1.17478
>> 47 16 728 712 50.7407 0 -
>> 1.17478
>> 48 16 772 756 52.9332 44 0.234811
>> 1.1946
>> 49 16 835 819 56.3577 252 5.67087
>> 1.12063
>> 50 16 890 874 59.1252 220 0.230806
>> 1.06778
>> 51 16 896 880 58.5409 24 0.382471
>> 1.06121
>> 52 16 896 880 57.5832 0 -
>> 1.06121
>> 53 16 896 880 56.6562 0 -
>> 1.06121
>> 54 16 896 880 55.7587 0 -
>> 1.06121
>> 55 16 897 881 54.9515 1 4.88333
>> 1.06554
>> 56 16 897 881 54.1077 0 -
>> 1.06554
>> 57 16 897 881 53.2894 0 -
>> 1.06554
>> 58 16 897 881 51.9335 0 -
>> 1.06554
>> 59 16 897 881 51.1792 0 -
>> 1.06554
>> 2015-09-02 09:27:01.267301min lat: 0.01405 max lat: 27.6704 avg lat:
>> 1.06554
>> sec Cur ops started finished avg MB/s cur MB/s last lat avg
>> lat
>> 60 16 897 881 50.4445 0 -
>> 1.06554
>>
>>
>>
>>
>>
>> cluster 98d89661-f616-49eb-9ccf-84d720e179c0
>> health HEALTH_OK
>> monmap e3: 3 mons at
>> {s01=10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1
>> <http://10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1>
>> 0.100.50.3:6789/0 <http://0.100.50.3:6789/0>}, election epoch 666,
>> quorum 0,1,2 s01,s02,s03
>> * osdmap e121039: 240 osds: 240 up, 240 in*
>> pgmap v850698: 7232 pgs, 31 pools, 439 GB data, 43090 kobjects
>> 2635 GB used, 867 TB / 870 TB avail
>> 7226 active+clean
>> 6 active+clean+scrubbing+deep
>>
>
> Note the last line there. You'll likely want to try your test again when
> scrubbing is complete. Also, you may want to try this script:
>
Yeah i have tried few times when cluster is perfectly healthy ( not doing
scrubbing / repairs )
>
> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
>
> You can invoke it like:
>
> ceph pg dump | ./readpgdump.py
>
> That will give you a bunch of information about the pools on your system.
> I'm a little concerned about how many PGs your glance-test pool may have
> given your totals above.
>
Thanks for the link i would do that and also run rados bench for other
pools ( where PG is higher )
Now here are my some observations
1# When the cluster is not doing anything , Health_ok , with no background
scrubbing / repairing. Also all system resources CPU/MEM/NET are mostly
idle. In this Case when i start rados bench ( write / rand / seq ) , after
suddenly a few seconds
--- rados bench output drops from ~500M to few 10M
--- At the same time CPU busy 90% and System load bumps UP
Once rados bench completes
--- After few minutes System resources becomes IDLE
2# Sometime some PG becomes unclean for a few minutes while rados bench
runs and then then quickly they becomes active+clean
I am out of clues , so any help from community that leads me to think in
right direction , would be helpful.
- Vickey -
>
>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com