Re: [ceph-users] NUMA zone_reclaim_mode
On Mon, Jan 12, 2015 at 8:25 AM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.net wrote: On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Sounds good to me. Do you mind submitting a patch that prints a warning from either FileStore::_detect_fs()? That will appear in the local ceph-osd.NNN.log. Alternatively, we should send something to the cluster log (osd-clog.warning() ...) but if we go that route we need to be careful that the logger it up and running first, which (I think) rules out FileStore::_detect_fs(). It could go in OSD itself although that seems less clean since the recommendation probably doesn't apply when using a backend that doesn't use a file system… Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB already solved the heuristic: https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp It’s licensed as AGPLv3 -- do you already know if we can borrow such code into Ceph? https://www.gnu.org/licenses/license-list.html#AGPL I've read that and the linked Affero Article 13 and I actually can't tell if Ceph is safe to integrate or not, but I'm thinking no since the servers are under LGPL. :/ Also I'm not sure if storage system users qualify as remote users but I don't think we're going to print an Affero string every time somebody runs a ceph tool. ;) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NUMA zone_reclaim_mode
On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Sure, I?ll try to prepare a patch which warns but isn?t too annoying. MongoDB already solved the heuristic: https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp It?s licensed as AGPLv3 -- do you already know if we can borrow such code into Ceph? I don't think AGPL is compatible, nope! Sorry... sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NUMA zone_reclaim_mode
On 13/01/2015 01:10, Gregory Farnum wrote: On Mon, Jan 12, 2015 at 8:25 AM, Dan Van Der Ster daniel.vanders...@cern.ch wrote: On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.net wrote: On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Sounds good to me. Do you mind submitting a patch that prints a warning from either FileStore::_detect_fs()? That will appear in the local ceph-osd.NNN.log. Alternatively, we should send something to the cluster log (osd-clog.warning() ...) but if we go that route we need to be careful that the logger it up and running first, which (I think) rules out FileStore::_detect_fs(). It could go in OSD itself although that seems less clean since the recommendation probably doesn't apply when using a backend that doesn't use a file system… Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB already solved the heuristic: https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp It’s licensed as AGPLv3 -- do you already know if we can borrow such code into Ceph? https://www.gnu.org/licenses/license-list.html#AGPL I've read that and the linked Affero Article 13 and I actually can't tell if Ceph is safe to integrate or not, but I'm thinking no since the servers are under LGPL. :/ Also I'm not sure if storage system users qualify as remote users but I don't think we're going to print an Affero string every time somebody runs a ceph tool. ;) AGPL does not require that, the approach is more practical: if the server provides you with an API / call to retrieve the sources, such a call can't be removed. It would be good thing to be able to implement the following scenario: * I'm connected to a Ceph cluster via RADOS * I'd like to migrate all I have in this cluster to my own cluster * Let's ask the Ceph server for the complete and corresponding sources and recompile / repackage them locally * Deploy my Ceph cluster from the local packages * Migrate pools from the remote cluster to the local cluster Regardless of license requirements, in the long run, there is almost zero chance to migrate successfully from a service provider (Ceph or otherwise) to a local service otherwise. Note that I'm not only referring to the data stored in the cluster but also how you're using the service. The same problem you would experience if migrating from a MySQL server to a PostgresQL server, for instance. Cheers -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NUMA zone_reclaim_mode
Hi Dan, On 12/01/2015 17:25, Dan Van Der Ster wrote: On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.net mailto:s...@newdream.net wrote: On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Sounds good to me. Do you mind submitting a patch that prints a warning from either FileStore::_detect_fs()? That will appear in the local ceph-osd.NNN.log. Alternatively, we should send something to the cluster log (osd-clog.warning() ...) but if we go that route we need to be careful that the logger it up and running first, which (I think) rules out FileStore::_detect_fs(). It could go in OSD itself although that seems less clean since the recommendation probably doesn't apply when using a backend that doesn't use a file system… Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB already solved the heuristic: https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp It’s licensed as AGPLv3 -- do you already know if we can borrow such code into Ceph? If you're looking at adapting the https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp#L107 code block and a few others into Ceph the licensing terms do not apply. If you were to disregard the mongodb implementation and rewrite it from scratch, you would come up with the same code because there is no way to do it differently. A contrario, if the licensing terms were to be applied, every implementation recommending that /proc/sys/vm/zone_reclaim_mode is set to zero would always be under AGPLv3 because they are all a) reading the file, b) printing a warning. That would not make sense and this is why software implementing a trivial logic can't be copyrighted. In my opinion you can borrow code from startup_warnings_mongod.cpp into Ceph because it's trivial and non copyrightable. Cheers Cheers, Dan Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] NUMA zone_reclaim_mode
(apologies if you receive this more than once... apparently I cannot reply to a 1 year old message on the list). Dear all, I'd like to +10 this old proposal of Kyle's. Let me explain why... A couple months ago we started testing a new use-case with radosgw -- this new user is writing millions of small files and has been causing us some headaches. Since starting these tests, the relevant OSDs have been randomly freezing for up to ~60s at a time. We have dedicated servers for this use-case, so it doesn't affect our important RBD users, and the OSDs always came back anyway (wrongly marked me down...). So I didn't give this problem much attention, though I guessed that we must be suffering from some network connectivity problem. But last week I started looking into this problem in more detail. With increased debug_osd logs I saw that when these OSDs are getting marked down, even the osd tick message is not printed for 30s. I also correlated these outages with massive drops in cached memory -- it looked as if an admin was running drop_caches on our live machines. Here is what we saw: https://www.dropbox.com/s/418ve09b6m98tyc/Screenshot%202015-01-12%2010.04.16.png?dl=0 Notice the sawtooth cached pages. That server has 20 OSDs, each OSD has ~1 million files totalling around 40GB (~40kB objects). Compare that with a different OSD host, one that's used for Cinder RBD volumes (and doesn't suffer from the freezing OSD problem).: https://www.dropbox.com/s/1lmra5wz7e7qxjy/Screenshot%202015-01-12%2010.11.37.png?dl=0 These RBD servers have identical hardware, but in this case the 20 OSDs each hold around 100k files totalling ~400GB (~4MB objects). Clearly the 10x increase in num files on the radosgw OSDs appears to be causing a problem. In fact, since the servers are pretty idle most of the time, it appears that the _scrubbing_ of these 20 million files per server is causing the problem. It seems that scrubbing is creating quite some memory pressure (via the inode cache, especially), so I started testing different vfs_cache_pressure values (1,10,1000,1). The only value that sort of helped was vfs_cache_pressure = 1, but keeping all the inodes cached is a pretty extreme measure, and it won't scale up when these OSDs are more full (they're only around 1% full now!!) Then I discovered the infamous behaviour of zone_reclaim_mode = 1, and this old thread. And I read a bit more, e.g. http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases http://rhaas.blogspot.ch/2014/06/linux-disables-vmzonereclaimmode-by.html Indeed all our servers have zone_reclaim_mode = 1. Numerous DB communities regard this option as very bad for servers -- MongoDB even prints a warning message at startup if zone_reclaim_mode is enabled. And finally, in recent kernels (since ~June 2014) zone_reclaim_mode is disabled by default. The vm doc now says: zone_reclaim_mode is disabled by default. For file servers or workloads that benefit from having their data cached, zone_reclaim_mode should be left disabled as the caching effect is likely to be more important than data locality. I've set zone_reclaim_mode = 0 on these radosgw OSD servers, and the freezing OSD problem has gone away. Here's a plot of a server that had zone_reclaim_mode set to zero late on Jan 9th: https://www.dropbox.com/s/x5qyn1e1r6fasl5/Screenshot%202015-01-12%2011.47.27.png?dl=0 I also used numactl --interleave=all ceph command on one host, but it doesn't appear to make a huge different beyond disabling numa zone reclaim. Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Cheers, Dan On Thu, Dec 12, 2013 at 11:30 PM, Kyle Bader kyle.ba...@gmail.com wrote: It seems that NUMA can be problematic for ceph-osd daemons in certain circumstances. Namely it seems that if a NUMA zone is running out of memory due to uneven allocation it is possible for a NUMA zone to enter reclaim mode when threads/processes are scheduled on a core in that zone and those processes are request memory allocations greater than the zones remaining memory. In order for the kernel to satisfy the memory allocation for those processes it needs to page out some of the contents of the contentious zone, which can have dramatic performance implications due to cache misses, etc. I see two ways an operator could alleviate these issues: Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing ceph-osd daemons with numactl --interleave=all. This should probably be activated by a flag in
Re: [ceph-users] NUMA zone_reclaim_mode
On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Sounds good to me. Do you mind submitting a patch that prints a warning from either FileStore::_detect_fs()? That will appear in the local ceph-osd.NNN.log. Alternatively, we should send something to the cluster log (osd-clog.warning() ...) but if we go that route we need to be careful that the logger it up and running first, which (I think) rules out FileStore::_detect_fs(). It could go in OSD itself although that seems less clean since the recommendation probably doesn't apply when using a backend that doesn't use a file system... Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NUMA zone_reclaim_mode
On 12 Jan 2015, at 17:08, Sage Weil s...@newdream.netmailto:s...@newdream.net wrote: On Mon, 12 Jan 2015, Dan Van Der Ster wrote: Moving forward, I think it would be good for Ceph to a least document this behaviour, but better would be to also detect when zone_reclaim_mode != 0 and warn the admin (like MongoDB does). This line from the commit which disables it in the kernel is pretty wise, IMHO: On current machines and workloads it is often the case that zone_reclaim_mode destroys performance but not all users know how to detect this. Favour the common case and disable it by default. Sounds good to me. Do you mind submitting a patch that prints a warning from either FileStore::_detect_fs()? That will appear in the local ceph-osd.NNN.log. Alternatively, we should send something to the cluster log (osd-clog.warning() ...) but if we go that route we need to be careful that the logger it up and running first, which (I think) rules out FileStore::_detect_fs(). It could go in OSD itself although that seems less clean since the recommendation probably doesn't apply when using a backend that doesn't use a file system… Sure, I’ll try to prepare a patch which warns but isn’t too annoying. MongoDB already solved the heuristic: https://github.com/mongodb/mongo/blob/master/src/mongo/db/startup_warnings_mongod.cpp It’s licensed as AGPLv3 -- do you already know if we can borrow such code into Ceph? Cheers, Dan Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com