I use zone_reclaim_mode=7 all time on boot (I use qemu with NUMA memory locking on same nodes, so need to keep proportional RAM use). Now I trying to use migratepages for all ceph daemons (with tcmalloc -DTCMALLOC_SMALL_BUT_SLOW - to avoid OSDs memory abuse). There my script (migrate to node, keepeng maximum RAM:

#!/usr/bin/perl

sub getnodes{
        my @x;
        open(my $FF,'-|',"numastat $_[0]") or die $!;
        while(defined(my $s=<$FF>)){
                chomp($s);
                my @x=split(/\s+/,$s);
                if(shift(@x) eq $_[1]){
                        close($FF);
                        return @x;
                }
        }
        close($FF);
        undef;
}

for my $t (''){
#for my $t ('osd'){
        for(glob("/var/run/ceph/$t*.pid")){
                my @free=getnodes('-m','MemFree');
                pop @free;
                print "free=@free\n";
                die if(!@free);
                $p0=`cat $_`;
                chomp($p0);
                $p=`pgrep ceph-$t -F $_`;
                chomp($p);
                next if($p ne $p0);
                my @m=getnodes("-p $p",'Total');
                my $total=pop @m;
                print "mem[$p]=@m\n";
                die if(!@m or $#m != $#free);
                my ($max,$imax);
                for(0..$#m){
                        $m=$free[$_]-$total+$m[$_];
                        if($imax eq '' or $max<$m){
                                $max=$m;
                                $imax=$_;
                        }
                }
                next if($max<=0);
                for(0..$#m){
                        next if($_ eq $imax or $m[$_]==0);
                        print "migratepages $p $_ $imax\n";
                        system('migratepages',$p,$_,$imax);
                }
        }
}




Jan Schermer пишет:
In our case it was the co-scheduling with lots QEMUs that made it run bad. If 
you have a dedicated CEPH OSD server it would be beneficial only if the 
scheduler was moving the OSDs between different NUMA zones. (Which our ancient 
2.6.32 EL6 kernel AFAIK does but newer kernels do not).

I’m not seeing our OSDs having a problem with IO speed, Dumpling is just so 
slow and CPU-bound it would probably run at the same speed even with spindles. 
It is hard to design a NUMA system where all the resources are local - you’d 
have to have dedicated separate bonds for all zones etc. - easier to just use 
smaller 1 socket machines in greater numbers. When you add QEMU to the mix it’s 
practically impossible to have everything local to all the zones in one box.
(CEPH needs HBAs and NICs, QEMU needs NICs but shouldn’t ideally share cores 
with CEPH… depends very much on the workload and scale - with greater scale it 
probably doesn’t make sense to have a hyperconverged solution, unless it’s 
easier to just throw more hardware at the problem and only scaling horizontaly).


migratepages is a one-shot operation - memory placement after that will depends 
on the kernel you are running and scheduler and other settings. Having 
zone_reclaim_mode=1 should prevent memory from “leaking” to the other node, but 
could prevent effective filesystem caching depending on how much free memory 
you have. Surprisingly, I had ~150GB free (about 50% in each of two NUMA zones) 
that wasn’t used for cache without turning it off, not yet sure why.

Scripts will be coming tomorrow, I’d love to see if it makes a change for 
someone else, maybe I’m just un-breaking something in my setup.

Jan

On 24 Jun 2015, at 20:05, Somnath Roy <[email protected]> wrote:

Jan,
This is interesting as I tried to pin OSDs (though I didn't pin all the 
threads) as part of our tuning and didn't see much difference. I thought this 
could be primarily because of the following.

The NICs and HBAs could be always remote to some OSDs , unless you dedicate 
NICs to the OSDs running on the same NUMA node.

I never tried 'migratepages' though. But, I guess 'migratepages' we need to do 
one time after pinning, right  ?
I would love to see your scripts and try it out in my environment.

Thanks & Regards
Somnath



-----Original Message-----
From: ceph-users [mailto:[email protected]] On Behalf Of Jan 
Schermer
Sent: Wednesday, June 24, 2015 10:54 AM
To: Ben Hines
Cc: [email protected]
Subject: Re: [ceph-users] Switching from tcmalloc

We did, but I don’t have the numbers. I have lots of graphs, though. We were 
mainly trying to solve the CPU usage, since our nodes are converged QEMU+CEPH 
OSDs, so this made a difference. We were also seeing the performance capped on 
CPUs when deleting snapshots of backfilling, all this should be solved with 
this.

We graph latency, outstanding operations, you name it - I can share a few 
graphs with you tomorrow if I get the permission from my boss :-) Makes for a 
nice comparison with real workload to have one node tcmalloc-free and the 
others running vanilla ceph-osd.

I guess I can share the final script once that’s finished - right now it uses 
taskset and then migratepages to the correct NUMA node and is not that nice, 
the cgroup one will be completely different.

You can try migratepages for yourself to test if it makes a difference - pin an 
OSD to a specific node (don’t forget to pin all threads) and then run 
“migratepages $pid old_node new_node”.
You can confirm the memory moving with “numastat -p $pid”. If it doesn’t seem 
to move then it is probably pagecache allocated on the wrong node, not sure if 
that can be migrated but you can use /proc/sys/vm/zone_reclaim_mode (=1) which 
should drop it. I advise setting it to 0 in the end, though as cache is always 
faster than disks.
YMMV depending on bottlenecks your system has.

Jan


On 24 Jun 2015, at 19:36, Ben Hines <[email protected]> wrote:

Did you do before/after Ceph performance benchmarks? I dont care if my
systems are using 80% cpu, if Ceph performance is better than when
it's using 20% cpu.

Can you share any scripts you have to automate these things? (NUMA
pinning, migratepages)

thanks,

-Ben

On Wed, Jun 24, 2015 at 10:25 AM, Jan Schermer <[email protected]> wrote:
There were essentialy three things we had to do for such a drastic
drop

1) recompile CEPH —without-tcmalloc
2) pin the OSDs to a set of a specific NUMA zone  - we had this for a
long time and it really helped
3) migrate the OSD memory to the correct CPU with migratepages
- we will use cgroups in the future for this, should make life easier
and is the only correct solution

It is similiar to the effect of just restarting the OSD, but much
better - since we immediately see hundreds of connections on a
freshly restarted OSD (and in the benchmark the tcmalloc issue
manifested with just two clients in
parallel) I’d say we never saw the raw performance with tcmalloc
(undegraded), but it was never this good - consistently low
latencies, much smaller spikes when something happens and much lower
CPU usage (about 50% savings but we’re also backfilling a lot on the
background). Workloads are faster as well - like reweighting OSDs on
that same node was much (hundreds of percent) faster.

So far the effect has been drastic. I wonder why tcmalloc was even
used when people are having problems with it? The glibc malloc seems
to work just fine for us.

The only concerning thing is the virtual memory usage - we are over
400GB VSS with a few OSDs. That doesn’t hurt anything, though.

Jan


On 24 Jun 2015, at 18:46, Robert LeBlanc <[email protected]> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Did you see what the effect of just restarting the OSDs before using
tcmalloc? I've noticed that there is usually a good drop for us just
by restarting them. I don't think it is usually this drastic.

- ----------------
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Jun 24, 2015 at 2:08 AM, Jan Schermer  wrote:
Can you guess when we did that?
Still on dumpling, btw...

http://www.zviratko.net/link/notcmalloc.png

Jan

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVit75CRDmVDuy+mK58QAAmjcP/jU+wyohdwKDP+FHDAgJ
DcqdB5aPG2AM79iLcYUub5bQjdNJpcWN/hyZcNdF3aSzEV3aY6jIqu9OpOIB
c2fIzfGOoczzW/FEf7qKRVGpxaQL21Sw1LpwMEscNe0ETz9HMHoaAnBO9IFn
nUEOCdEpRBO5W1rWwNAx9EVnOUPklb7vVEpY23sgtHhQSprb9oeO8D99AMRz
/RhdHKlRDgHBjun/stCiR6lFuvBUx0GBmyaMuO5rfsLGRIkySLv++3CLQI6X
NCt/MjYwTTNNfO/y/MjkiV/j+Cm1G1lcjlgbDjilf7bgf8/7W2vJa1sMtaA4
xJL+PpZxiKcGSdC96B+EBYxLhLcwsNpbfq7uxQOkIspa66mkIMAVzJgt4DFL
Ca+UY3ODA26VtWF5U/hkdupgld+YSxXTyJakeShrBSFAX0a4cygV9Ll7SIhO
IDS+0Mbur0IGzIWRgtCQhRXsc7wn3IoIovqe8Nfk4xupeoK2P5UHO1rW9pWy
Jwj5PXieDqxgx8RKlulN1bCbSgTaEdveTiqqVxlnM9L0MhgesuB8vkpHbsqn
mYJHNzU7ghU89xLnRuia9rBlpjw4OzagfowAJTH3UnaO67kxES+IWO8onQbN
RhY0QR5cB5rVSjYkzzlsuLM17fQPcT8++yMarKdsrr6WIGppXUFFdATAqIaY
DHD1
=goL4
-----END PGP SIGNATURE-----



_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to