subject:"\\\\\\\[ceph\\\\\\\-users\\\\\\\] NUMA zone_reclaim

Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-14 Thread Gregory Farnum

On Mon, Jan 12, 2015 at 8:25 AM, Dan Van Der Ster
daniel.vanders...@cern.ch wrote:

 On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.net wrote:

 On Mon, 12 Jan 2015, Dan Van Der Ster wrote:

 Moving forward, I think it would be good for Ceph to a least document
 this behaviour, but better would be to also detect when
 zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
 line from the commit which disables it in the kernel is pretty wise,
 IMHO: On current machines and workloads it is often the case that
 zone_reclaim_mode destroys performance but not all users know how to
 detect this. Favour the common case and disable it by default.


 Sounds good to me.  Do you mind submitting a patch that prints a warning
 from either FileStore::_detect_fs()?  That will appear in the local
 ceph-osd.NNN.log.

 Alternatively, we should send something to the cluster log
 (osd-clog.warning()  ...) but if we go that route we need to be
 careful that the logger it up and running first, which (I think) rules out
 FileStore::_detect_fs().  It could go in OSD itself although that seems
 less clean since the recommendation probably doesn't apply when
 using a backend that doesn't use a file system…


 Sure, I’ll try to prepare a patch which warns but isn’t too annoying.
 MongoDB already solved the heuristic:

 https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp

 It’s licensed as AGPLv3 -- do you already know if we can borrow such code
 into Ceph?

https://www.gnu.org/licenses/license-list.html#AGPL

I've read that and the linked Affero Article 13 and I actually can't
tell if Ceph is safe to integrate or not, but I'm thinking no since
the servers are under LGPL. :/ Also I'm not sure if storage system
users qualify as remote users but I don't think we're going to print
an Affero string every time somebody runs a ceph tool. ;)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-14 Thread Sage Weil

On Mon, 12 Jan 2015, Dan Van Der Ster wrote:
 Sure, I?ll try to prepare a patch which warns but isn?t too annoying. 
 MongoDB already solved the heuristic:
 
 https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp
 
 It?s licensed as AGPLv3 -- do you already know if we can borrow such 
 code into Ceph?

I don't think AGPL is compatible, nope!  Sorry...

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-14 Thread Loic Dachary



On 13/01/2015 01:10, Gregory Farnum wrote:
 On Mon, Jan 12, 2015 at 8:25 AM, Dan Van Der Ster
 daniel.vanders...@cern.ch wrote:

 On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.net wrote:

 On Mon, 12 Jan 2015, Dan Van Der Ster wrote:

 Moving forward, I think it would be good for Ceph to a least document
 this behaviour, but better would be to also detect when
 zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
 line from the commit which disables it in the kernel is pretty wise,
 IMHO: On current machines and workloads it is often the case that
 zone_reclaim_mode destroys performance but not all users know how to
 detect this. Favour the common case and disable it by default.


 Sounds good to me.  Do you mind submitting a patch that prints a warning
 from either FileStore::_detect_fs()?  That will appear in the local
 ceph-osd.NNN.log.

 Alternatively, we should send something to the cluster log
 (osd-clog.warning()  ...) but if we go that route we need to be
 careful that the logger it up and running first, which (I think) rules out
 FileStore::_detect_fs().  It could go in OSD itself although that seems
 less clean since the recommendation probably doesn't apply when
 using a backend that doesn't use a file system…


 Sure, I’ll try to prepare a patch which warns but isn’t too annoying.
 MongoDB already solved the heuristic:

 https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp

 It’s licensed as AGPLv3 -- do you already know if we can borrow such code
 into Ceph?
 
 https://www.gnu.org/licenses/license-list.html#AGPL
 
 I've read that and the linked Affero Article 13 and I actually can't
 tell if Ceph is safe to integrate or not, but I'm thinking no since
 the servers are under LGPL. :/ Also I'm not sure if storage system
 users qualify as remote users but I don't think we're going to print
 an Affero string every time somebody runs a ceph tool. ;)

AGPL does not require that, the approach is more practical: if the server 
provides you with an API / call to retrieve the sources, such a call can't be 
removed. It would be good thing to be able to implement the following scenario:

 * I'm connected to a Ceph cluster via RADOS
 * I'd like to migrate all I have in this cluster to my own cluster
 * Let's ask the Ceph server for the complete and corresponding sources and 
recompile / repackage them locally
 * Deploy my Ceph cluster from the local packages
 * Migrate pools from the remote cluster to the local cluster

Regardless of license requirements, in the long run, there is almost zero 
chance to  migrate successfully from a service provider (Ceph or otherwise) to 
a local service otherwise. Note that I'm not only referring to the data stored 
in the cluster but also how you're using the service. The same problem you 
would experience if migrating from a MySQL server to a PostgresQL server, for 
instance. 

Cheers

 -Greg
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-14 Thread Loic Dachary

Hi Dan,

On 12/01/2015 17:25, Dan Van Der Ster wrote:
 
 On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.net 
 mailto:s...@newdream.net wrote:

 On Mon, 12 Jan 2015, Dan Van Der Ster wrote:
 Moving forward, I think it would be good for Ceph to a least document
 this behaviour, but better would be to also detect when
 zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
 line from the commit which disables it in the kernel is pretty wise,
 IMHO: On current machines and workloads it is often the case that
 zone_reclaim_mode destroys performance but not all users know how to
 detect this. Favour the common case and disable it by default.

 Sounds good to me.  Do you mind submitting a patch that prints a warning
 from either FileStore::_detect_fs()?  That will appear in the local
 ceph-osd.NNN.log.

 Alternatively, we should send something to the cluster log
 (osd-clog.warning()  ...) but if we go that route we need to be
 careful that the logger it up and running first, which (I think) rules out
 FileStore::_detect_fs().  It could go in OSD itself although that seems
 less clean since the recommendation probably doesn't apply when
 using a backend that doesn't use a file system…
 
 Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB 
 already solved the heuristic:
 
 https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp
 
 It’s licensed as AGPLv3 -- do you already know if we can borrow such code 
 into Ceph?

If you're looking at adapting the 

https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp#L107

code block and a few others into Ceph the licensing terms do not apply. If you 
were to disregard the mongodb implementation and rewrite it from scratch, you 
would come up with the same code because there is no way to do it differently. 
A contrario, if the licensing terms were to be applied, every implementation 
recommending that /proc/sys/vm/zone_reclaim_mode is set to zero would always be 
under AGPLv3 because they are all a) reading the file, b) printing a warning. 
That would not make sense and this is why software implementing a trivial logic 
can't be copyrighted. 

In my opinion you can borrow code from startup_warnings_mongod.cpp into Ceph 
because it's trivial and non copyrightable. 

Cheers

 
 Cheers, Dan
 
 
 Thanks!
 sage
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] NUMA zone_reclaim_mode

2015-01-12 Thread Dan Van Der Ster

(apologies if you receive this more than once... apparently I cannot reply to a
1 year old message on the list).

Dear all,
I'd like to +10 this old proposal of Kyle's. Let me explain why...

A couple months ago we started testing a new use-case with radosgw --
this new user is writing millions of small files and has been causing
us some headaches. Since starting these tests, the relevant OSDs have
been randomly freezing for up to ~60s at a time. We have dedicated
servers for this use-case, so it doesn't affect our important RBD
users, and the OSDs always came back anyway (wrongly marked me
down...). So I didn't give this problem much attention, though I
guessed that we must be suffering from some network connectivity
problem.

But last week I started looking into this problem in more detail. With
increased debug_osd logs I saw that when these OSDs are getting marked
down, even the osd tick message is not printed for 30s. I also
correlated these outages with massive drops in cached memory -- it
looked as if an admin was running drop_caches on our live machines.
Here is what we saw:

https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0

Notice the sawtooth cached pages. That server has 20 OSDs, each OSD
has ~1 million files totalling around 40GB (~40kB objects). Compare
that with a different OSD host, one that's used for Cinder RBD volumes
(and doesn't suffer from the freezing OSD problem).:

https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0

These RBD servers have identical hardware, but in this case the 20
OSDs each hold around 100k files totalling ~400GB (~4MB objects).

Clearly the 10x increase in num files on the radosgw OSDs appears to
be causing a problem. In fact, since the servers are pretty idle most
of the time, it appears that the _scrubbing_ of these 20 million files
per server is causing the problem. It seems that scrubbing is creating
quite some memory pressure (via the inode cache, especially), so I
started testing different vfs_cache_pressure values (1,10,1000,1).
The only value that sort of helped was vfs_cache_pressure = 1, but
keeping all the inodes cached is a pretty extreme measure, and it
won't scale up when these OSDs are more full (they're only around 1%
full now!!)

Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and
this old thread. And I read a bit more, e.g.

http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases
http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html

Indeed all our servers have zone_reclaim_mode = 1. Numerous DB
communities regard this option as very bad for servers -- MongoDB even
prints a warning message at startup if zone_reclaim_mode is enabled.
And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is
disabled by default. The vm doc now says:

zone_reclaim_mode is disabled by default. For file servers or
workloads that benefit from having their data cached,
zone_reclaim_mode should be left disabled as the caching effect is
likely to be more important than data locality.

I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the
freezing OSD problem has gone away. Here's a plot of a server that had
zone_reclaim_mode set to zero late on Jan 9th:

https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0

I also used numactl --interleave=all ceph command on one host, but
it doesn't appear to make a huge different beyond disabling numa zone
reclaim.

Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default.

Cheers, Dan

On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader kyle.ba...@gmail.com wrote:
It seems that NUMA can be problematic for ceph-osd daemons in certain
circumstances. Namely it seems that if a NUMA zone is running out of
memory due to uneven allocation it is possible for a NUMA zone to
enter reclaim mode when threads/processes are scheduled on a core in
that zone and those processes are request memory allocations greater
than the zones remaining memory. In order for the kernel to satisfy
the memory allocation for those processes it needs to page out some of
the contents of the contentious zone, which can have dramatic
performance implications due to cache misses, etc. I see two ways an
operator could alleviate these issues:

Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
ceph-osd daemons with numactl --interleave=all. This should probably
be activated by a flag in

Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-12 Thread Sage Weil

On Mon, 12 Jan 2015, Dan Van Der Ster wrote:
 Moving forward, I think it would be good for Ceph to a least document
 this behaviour, but better would be to also detect when
 zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
 line from the commit which disables it in the kernel is pretty wise,
 IMHO: On current machines and workloads it is often the case that
 zone_reclaim_mode destroys performance but not all users know how to
 detect this. Favour the common case and disable it by default.

Sounds good to me.  Do you mind submitting a patch that prints a warning 
from either FileStore::_detect_fs()?  That will appear in the local 
ceph-osd.NNN.log.

Alternatively, we should send something to the cluster log 
(osd-clog.warning()  ...) but if we go that route we need to be 
careful that the logger it up and running first, which (I think) rules out 
FileStore::_detect_fs().  It could go in OSD itself although that seems 
less clean since the recommendation probably doesn't apply when 
using a backend that doesn't use a file system...

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NUMA zone_reclaim_mode

2015-01-12 Thread Dan Van Der Ster


On 12 Jan 2015, at 17:08, Sage Weil 
s...@newdream.netmailto:s...@newdream.net wrote:

On Mon, 12 Jan 2015, Dan Van Der Ster wrote:
Moving forward, I think it would be good for Ceph to a least document
this behaviour, but better would be to also detect when
zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This
line from the commit which disables it in the kernel is pretty wise,
IMHO: On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to
detect this. Favour the common case and disable it by default.

Sounds good to me.  Do you mind submitting a patch that prints a warning
from either FileStore::_detect_fs()?  That will appear in the local
ceph-osd.NNN.log.

Alternatively, we should send something to the cluster log
(osd-clog.warning()  ...) but if we go that route we need to be
careful that the logger it up and running first, which (I think) rules out
FileStore::_detect_fs().  It could go in OSD itself although that seems
less clean since the recommendation probably doesn't apply when
using a backend that doesn't use a file system…

Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB 
already solved the heuristic:

https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp

It’s licensed as AGPLv3 -- do you already know if we can borrow such code into 
Ceph?

Cheers, Dan


Thanks!
sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NUMA zone_reclaim_mode

Re: [ceph-users] NUMA zone_reclaim_mode

Re: [ceph-users] NUMA zone_reclaim_mode

Re: [ceph-users] NUMA zone_reclaim_mode

[ceph-users] NUMA zone_reclaim_mode

Re: [ceph-users] NUMA zone_reclaim_mode

Re: [ceph-users] NUMA zone_reclaim_mode

7 matches

Site Navigation

Mail list logo

Footer information