Re: [ceph-users] Ceph read / write : Terrible performance

Vickey Singh Wed, 02 Sep 2015 13:29:36 -0700

Thank You Mark , please see my response below.

On Wed, Sep 2, 2015 at 5:23 PM, Mark Nelson <[email protected]> wrote:


> On 09/02/2015 08:51 AM, Vickey Singh wrote:
>
>> Hello Ceph Experts
>>
>> I have a strange problem , when i am reading or writing to Ceph pool ,
>> its not writing properly. Please notice Cur MB/s which is going up and
>> down
>>
>> --- Ceph Hammer 0.94.2
>> -- CentOS 6, 2.6
>> -- Ceph cluster is healthy
>>
>
> You might find that CentOS7 gives you better performance.  In some cases
> we were seeing nearly 2X.


Wooo 2X , i would definitely plan for upgrade. Thanks


>
>
>
>>
>> One interesting thing is when every i start rados bench command for read
>> or write CPU Idle % goes down ~10 and System load is increasing like
>> anything.
>>
>> Hardware
>>
>> HpSL4540
>>
>
> Please make sure the controller is on the newest firmware.  There used to
> be a bug that would cause sequential write performance to bottleneck when
> writeback cache was enabled on the RAID controller.


Last month i have upgraded the firmwares for this hardware , so i hope they
are up to date.


>
>
> 32Core CPU
>> 196G Memory
>> 10G Network
>>
>
> Be sure to check the network too.  We've seen a lot of cases where folks
> have been burned by one of the NICs acting funky.
>

At a first view , Interface looks good and they are pushing data nicely (
what ever they are getting )


>
>
>> I don't think hardware is a problem.
>>
>> Please give me clues / pointers , how should i troubleshoot this problem.
>>
>>
>>
>> # rados bench -p glance-test 60 write
>>   Maintaining 16 concurrent writes of 4194304 bytes for up to 60 seconds
>> or 0 objects
>>   Object prefix: benchmark_data_pouta-s01.pouta.csc.fi_2173350
>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>> lat
>>       0       0         0         0         0         0         -
>>  0
>>       1      16        20         4     15.99        16   0.12308
>>  0.10001
>>       2      16        37        21   41.9841        68   1.79104
>> 0.827021
>>       3      16        68        52   69.3122       124  0.084304
>> 0.854829
>>       4      16       114        98   97.9746       184   0.12285
>> 0.614507
>>       5      16       188       172   137.568       296  0.210669
>> 0.449784
>>       6      16       248       232   154.634       240  0.090418
>> 0.390647
>>       7      16       305       289    165.11       228  0.069769
>> 0.347957
>>       8      16       331       315   157.471       104  0.026247
>> 0.3345
>>       9      16       361       345   153.306       120  0.082861
>> 0.320711
>>      10      16       380       364   145.575        76  0.027964
>> 0.310004
>>      11      16       393       377   137.067        52   3.73332
>> 0.393318
>>      12      16       448       432   143.971       220  0.334664
>> 0.415606
>>      13      16       476       460   141.508       112  0.271096
>> 0.406574
>>      14      16       497       481   137.399        84  0.257794
>> 0.412006
>>      15      16       507       491   130.906        40   1.49351
>> 0.428057
>>      16      16       529       513   115.042        88  0.399384
>>  0.48009
>>      17      16       533       517   94.6286        16   5.50641
>> 0.507804
>>      18      16       537       521    83.405        16   4.42682
>> 0.549951
>>      19      16       538       522    80.349         4   11.2052
>> 0.570363
>> 2015-09-02 09:26:18.398641min lat: 0.023851 max lat: 11.2052 avg lat:
>> 0.570363
>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>> lat
>>      20      16       538       522   77.3611         0         -
>> 0.570363
>>      21      16       540       524   74.8825         4   8.88847
>> 0.591767
>>      22      16       542       526   72.5748         8   1.41627
>> 0.593555
>>      23      16       543       527   70.2873         4    8.0856
>> 0.607771
>>      24      16       555       539   69.5674        48  0.145199
>> 0.781685
>>      25      16       560       544   68.0177        20    1.4342
>> 0.787017
>>      26      16       564       548   66.4241        16  0.451905
>>  0.78765
>>      27      16       566       550   64.7055         8  0.611129
>> 0.787898
>>      28      16       570       554   63.3138        16   2.51086
>> 0.797067
>>      29      16       570       554   61.5549         0         -
>> 0.797067
>>      30      16       572       556   60.1071         4   7.71382
>> 0.830697
>>      31      16       577       561   59.0515        20   23.3501
>> 0.916368
>>      32      16       590       574   58.8705        52  0.336684
>> 0.956958
>>      33      16       591       575   57.4986         4   1.92811
>> 0.958647
>>      34      16       591       575   56.0961         0         -
>> 0.958647
>>      35      16       591       575   54.7603         0         -
>> 0.958647
>>      36      16       597       581   54.0447         8  0.187351
>>  1.00313
>>      37      16       625       609   52.8394       112   2.12256
>>  1.09256
>>      38      16       631       615    52.227        24   1.57413
>>  1.10206
>>      39      16       638       622   51.7232        28   4.41663
>>  1.15086
>> 2015-09-02 09:26:40.510623min lat: 0.023851 max lat: 27.6704 avg lat:
>> 1.15657
>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>> lat
>>      40      16       652       636   51.8102        56  0.113345
>>  1.15657
>>      41      16       682       666   53.1443       120  0.041251
>>  1.17813
>>      42      16       685       669   52.3395        12  0.501285
>>  1.17421
>>      43      15       690       675   51.7955        24   2.26605
>>  1.18357
>>      44      16       728       712   53.6062       148  0.589826
>>  1.17478
>>      45      16       728       712   52.6158         0         -
>>  1.17478
>>      46      16       728       712   51.6613         0         -
>>  1.17478
>>      47      16       728       712   50.7407         0         -
>>  1.17478
>>      48      16       772       756   52.9332        44  0.234811
>> 1.1946
>>      49      16       835       819   56.3577       252   5.67087
>>  1.12063
>>      50      16       890       874   59.1252       220  0.230806
>>  1.06778
>>      51      16       896       880   58.5409        24  0.382471
>>  1.06121
>>      52      16       896       880   57.5832         0         -
>>  1.06121
>>      53      16       896       880   56.6562         0         -
>>  1.06121
>>      54      16       896       880   55.7587         0         -
>>  1.06121
>>      55      16       897       881   54.9515         1   4.88333
>>  1.06554
>>      56      16       897       881   54.1077         0         -
>>  1.06554
>>      57      16       897       881   53.2894         0         -
>>  1.06554
>>      58      16       897       881   51.9335         0         -
>>  1.06554
>>      59      16       897       881   51.1792         0         -
>>  1.06554
>> 2015-09-02 09:27:01.267301min lat: 0.01405 max lat: 27.6704 avg lat:
>> 1.06554
>>     sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
>> lat
>>      60      16       897       881   50.4445         0         -
>>  1.06554
>>
>>
>>
>>
>>
>>      cluster 98d89661-f616-49eb-9ccf-84d720e179c0
>>       health HEALTH_OK
>>       monmap e3: 3 mons at
>> {s01=10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1
>> <http://10.100.50.1:6789/0,s02=10.100.50.2:6789/0,s03=1>
>> 0.100.50.3:6789/0 <http://0.100.50.3:6789/0>}, election epoch 666,
>> quorum 0,1,2 s01,s02,s03
>> *     osdmap e121039: 240 osds: 240 up, 240 in*
>>        pgmap v850698: 7232 pgs, 31 pools, 439 GB data, 43090 kobjects
>>              2635 GB used, 867 TB / 870 TB avail
>>                  7226 active+clean
>>                     6 active+clean+scrubbing+deep
>>
>
> Note the last line there.  You'll likely want to try your test again when
> scrubbing is complete.  Also, you may want to try this script:
>

Yeah i have tried few times when cluster is perfectly healthy ( not doing
scrubbing / repairs )


>
> https://github.com/ceph/cbt/blob/master/tools/readpgdump.py
>
> You can invoke it like:
>
> ceph pg dump | ./readpgdump.py
>
> That will give you a bunch of information about the pools on your system.
> I'm a little concerned about how many PGs your glance-test pool may have
> given your totals above.
>

Thanks for the link i would do that and also run rados bench for other
pools ( where PG is higher )


Now here are my some observations

1#  When the cluster is not doing anything , Health_ok , with no background
scrubbing / repairing. Also all system resources CPU/MEM/NET are mostly
idle. In this Case when i start rados bench ( write / rand / seq ) , after
suddenly a few seconds

       --- rados bench output drops from ~500M to few 10M
      --- At the same time CPU busy 90%  and System load bumps UP

Once rados bench completes

     --- After few minutes System resources  becomes IDLE

2#   Sometime some PG becomes unclean for a few minutes while rados bench
runs and then then quickly they becomes active+clean


I am out of clues , so any help from community that leads me to think in
right direction , would be helpful.


- Vickey -


>
>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph read / write : Terrible performance

Reply via email to