I had the same problem when doing benchmarks with small block sizes (<8k) to 
RBDs. These settings seemed to fix the problem for me.

sudo ceph tell osd.* injectargs '--filestore_merge_threshold 40'
sudo ceph tell osd.* injectargs '--filestore_split_multiple 8'

After you apply the settings give it a few minutes to shuffle the data around.

-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Monday, May 11, 2015 3:21 AM
To: Nikola Ciprich
Cc: ceph-users; n...@linuxbox.cz
Subject: Re: [ceph-users] very different performance on two volumes in the same 
pool #2

Nik,
If you increase num_jobs  beyond 4 , is it helping further ?  Try 8 or so.
Yeah,  libsoft* is definitely consuming some cpu cycles , but I don't know how 
to resolve that.
Also, acpi_processor_ffh_cstate_enter popped up and consuming lot of cpu. Try 
disabling cstate and run cpu in maximum performance mode , this may give you 
some boost.

Thanks & Regards
Somnath

-----Original Message-----
From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz] 
Sent: Sunday, May 10, 2015 11:32 PM
To: Somnath Roy
Cc: ceph-users; n...@linuxbox.cz
Subject: Re: [ceph-users] very different performance on two volumes in the same 
pool #2

On Mon, May 11, 2015 at 06:07:21AM +0000, Somnath Roy wrote:
> Yes, you need to run fio clients on a separate box, it will take quite a bit 
> of cpu.
> Stopping OSDs on other nodes, rebalancing will start. Have you waited cluster 
> to go for active + clean state ? If you are running while rebalancing is 
> going on , the performance will be impacted.
I set noout, so there was no rebalancing, I forgot to mention that..


> 
> ~110%  cpu util seems pretty low. Try to run fio_rbd with more num_jobs (say 
> 3 or 4 or more), io_depth =64 is fine and see if it improves performance or 
> not.
ok, increasing jobs to 4 seems to squeeze a bit more from the cluster, about 
43.3K iops..

OSD cpu util jumps to ~300% on both alive nodes, so there seems to be still a 
bit of reserves.. 

> Also, since you have 3 OSDs (3 nodes?), I would suggest to tweak the 
> following settings
> 
> osd_op_num_threads_per_shard
> osd_op_num_shards
> 
> May be (1,10 / 1,15 / 2, 10 ?).

tried all those combinations, but it doesn't make almost any difference..

do you think I could get more then those 43k?

one more think that makes me wonder a bit is this line I can see in perf:
  2.21%  libsoftokn3.so             [.] 0x000000000001ebb2

I suppose this has something to do with resolving, 2.2% seems quite a lot to 
me..
Should I be worried about it? Does it make sense to enable kernel DNS resolving 
support in ceph?

thanks for your time Somnath!

nik



> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Nikola Ciprich [mailto:nikola.cipr...@linuxbox.cz]
> Sent: Sunday, May 10, 2015 10:33 PM
> To: Somnath Roy
> Cc: ceph-users; n...@linuxbox.cz
> Subject: Re: [ceph-users] very different performance on two volumes in 
> the same pool #2
> 
> 
> On Mon, May 11, 2015 at 05:20:25AM +0000, Somnath Roy wrote:
> > Two things..
> > 
> > 1. You should always use SSD drives for benchmarking after preconditioning 
> > it.
> well, I don't really understand... ?
> 
> > 
> > 2. After creating and mapping rbd lun, you need to write data first 
> > to read it afterword otherwise fio output will be misleading. In 
> > fact, I think you will see IO is not even hitting cluster (check 
> > with ceph -s)
> yes, so this approves my conjecture. ok.
> 
> 
> > 
> > Now, if you are saying it's a 3 OSD setup, yes, ~23K is pretty low. Check 
> > the following.
> > 
> > 1. Check client or OSd node cpu is saturating or not.
> On OSD nodes, I can see cpeh-osd CPU utilisation of ~110%. On client node 
> (which is one of OSD nodes as well), I can see fio eating quite lot of CPU 
> cycles.. I tried stopping ceph-osd on this node (thus only two nodes are 
> serving data) and performance got a bit higher, to ~33k IOPS. But still I 
> think it's not very good..
> 
> 
> > 
> > 2. With 4K, hope network BW is fine
> I think it's ok..
> 
> 
> > 
> > 3. Number of PGs/pool should be ~128 or so.
> I'm using pg_num 128
> 
> 
> > 
> > 4. If you are using krbd, you might want to try latest krbd module where 
> > TCP_NODELAY problem is fixed. If you don't want that complexity, try with 
> > fio-rbd.
> I'm not using RBD (only for writing data to volume), for benchmarking, I'm 
> using fio-rbd.
> 
> anything else I could check?
> 
> 
> > 
> > Hope this helps,
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On 
> > Behalf Of Nikola Ciprich
> > Sent: Sunday, May 10, 2015 9:43 PM
> > To: ceph-users
> > Cc: n...@linuxbox.cz
> > Subject: [ceph-users] very different performance on two volumes in 
> > the same pool #2
> > 
> > Hello ceph developers and users,
> > 
> > some time ago, I posted here a question regarding very different 
> > performance for two volumes in one pool (backed by SSD drives).
> > 
> > After some examination, I probably got to the root of the problem..
> > 
> > When I create fresh volume (ie rbd create --image-format 2 --size
> > 51200 ssd/test) and run random io fio benchmark
> > 
> > fio  --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 
> > --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k
> > --iodepth=64 --readwrite=randread
> > 
> > I get very nice performance of up to 200k IOPS. However once the volume is 
> > written to (ie when I map it using rbd map and dd whole volume with some 
> > random data), and repeat the benchmark, random performance drops to ~23k 
> > IOPS.
> > 
> > This leads me to conjecture that for unwritten (sparse) volumes, read is 
> > just a noop, simply returning zeroes without really having to read data 
> > from physical storage, and thus showing nice performance, but once the 
> > volume is written, performance drops due to need to physically read the 
> > data, right?
> > 
> > However I'm a bit unhappy about the performance drop, the pool is backed by 
> > 3 SSD drives (each having random io performance of 100k iops) on three 
> > nodes, and object size is set to 3. Cluster is completely idle, nodes are 
> > quad core Xeons E3-1220 v3 @ 3.10GHz, 32GB RAM each, centos 6, kernel 
> > 3.18.12, ceph 0.94.1. I'm using libtcmalloc (I even tried upgrading 
> > gperftools-libs to 2.4) Nodes are connected using 10gb ethernet, with jumbo 
> > frames enabled.
> > 
> > 
> > I tried tuning following values:
> > 
> > osd_op_threads = 5
> > filestore_op_threads = 4
> > osd_op_num_threads_per_shard = 1
> > osd_op_num_shards = 25
> > filestore_fd_cache_size = 64
> > filestore_fd_cache_shards = 32
> > 
> > I don't see anything special in perf:
> > 
> >   5.43%  [kernel]              [k] acpi_processor_ffh_cstate_enter
> >   2.93%  libtcmalloc.so.4.2.6  [.] 0x0000000000017d2c
> >   2.45%  libpthread-2.12.so    [.] pthread_mutex_lock
> >   2.37%  libpthread-2.12.so    [.] pthread_mutex_unlock
> >   2.33%  [kernel]              [k] do_raw_spin_lock
> >   2.00%  libsoftokn3.so        [.] 0x000000000001f455
> >   1.96%  [kernel]              [k] __switch_to
> >   1.32%  [kernel]              [k] __schedule
> >   1.24%  libstdc++.so.6.0.13   [.] std::basic_ostream<char, 
> > std::char_traits<char> >& std::__ostream_insert<char, 
> > std::char_traits<char> >(std::basic_ostream<char, std::char
> >   1.24%  libc-2.12.so          [.] memcpy
> >   1.19%  libtcmalloc.so.4.2.6  [.] operator delete(void*)
> >   1.16%  [kernel]              [k] __d_lookup_rcu
> >   1.09%  libstdc++.so.6.0.13   [.] 0x000000000007d6be
> >   0.93%  libstdc++.so.6.0.13   [.] std::basic_streambuf<char, 
> > std::char_traits<char> >::xsputn(char const*, long)
> >   0.93%  ceph-osd              [.] crush_hash32_3
> >   0.85%  libc-2.12.so          [.] vfprintf
> >   0.84%  libc-2.12.so          [.] __strlen_sse42
> >   0.80%  [kernel]              [k] get_futex_key_refs
> >   0.80%  libpthread-2.12.so    [.] pthread_mutex_trylock
> >   0.78%  libtcmalloc.so.4.2.6  [.] 
> > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
> >  unsigned long, int)
> >   0.71%  libstdc++.so.6.0.13   [.] std::basic_string<char, 
> > std::char_traits<char>, std::allocator<char> >::basic_string(std::string 
> > const&)
> >   0.68%  ceph-osd              [.] ceph::log::Log::flush()
> >   0.66%  libtcmalloc.so.4.2.6  [.] tc_free
> >   0.63%  [kernel]              [k] resched_curr
> >   0.63%  [kernel]              [k] page_fault
> >   0.62%  libstdc++.so.6.0.13   [.] std::string::reserve(unsigned long)
> > 
> > I'm running benchmark directly on one of nodes, which I know is not 
> > optimal, but it's still able to give those 200k iops for empty volume, so I 
> > guess it shouldn't be problem..
> > 
> > Another story is random write performance, which is totally poor, but I't 
> > like to deal with read performance first..
> > 
> > 
> > so my question is, are those numbers normal? If not, what should I check?
> > 
> > I'll be very grateful for all the hints I could get..
> > 
> > thanks a lot in advance
> > 
> > nik
> > 
> > 
> > --
> > -------------------------------------
> > Ing. Nikola CIPRICH
> > LinuxBox.cz, s.r.o.
> > 28.rijna 168, 709 00 Ostrava
> > 
> > tel.:   +420 591 166 214
> > fax:    +420 596 621 273
> > mobil:  +420 777 093 799
> > www.linuxbox.cz
> > 
> > mobil servis: +420 737 238 656
> > email servis: ser...@linuxbox.cz
> > -------------------------------------
> > 
> > ________________________________
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is 
> > intended only for the use of the designated recipient(s) named above. If 
> > the reader of this message is not the intended recipient, you are hereby 
> > notified that you have received this message in error and that any review, 
> > dissemination, distribution, or copying of this message is strictly 
> > prohibited. If you have received this communication in error, please notify 
> > the sender by telephone or e-mail (as shown above) immediately and destroy 
> > any and all copies of this message in your possession (whether hard copies 
> > or electronically stored copies).
> > 
> > 
> 
> --
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -------------------------------------
> 

--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-------------------------------------
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to