Re: 0.56 scrub OSD memleaks, WAS Re: [0.48.3] OSD memory leak when scrubbing
Can you confirm that the memory size reported is res? -Sam On Mon, Feb 18, 2013 at 8:46 AM, Christopher Kunz chrisl...@de-punkt.de wrote: Am 16.02.13 10:09, schrieb Wido den Hollander: On 02/16/2013 08:09 AM, Andrey Korolyov wrote: Can anyone who hit this bug please confirm that your system contains libc 2.15+? Hello, when we started a deep scrub on our 0.56.2 cluster today, we saw a massive memleak about 1 hour into the scrub. One OSD claimed over 53GByte within 10 minutes. We had to restart the OSD to keep the cluster stable. Another OSD is currently claiming about 27GByte and will be restarted soon. All circumstantial evidence points to the deep scrub as the source of the leak. One affected node is running libc 2.15 (Ubuntu 12.04 LTS), the other one is using libc 2.11.3 (Debian Squeeze). So it seems this is not a libc-dependant issue. We have disabled scrub completely. Regards, --ck PS: Do we have any idea when this will be fixed? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
0.56 scrub OSD memleaks, WAS Re: [0.48.3] OSD memory leak when scrubbing
Am 16.02.13 10:09, schrieb Wido den Hollander: On 02/16/2013 08:09 AM, Andrey Korolyov wrote: Can anyone who hit this bug please confirm that your system contains libc 2.15+? Hello, when we started a deep scrub on our 0.56.2 cluster today, we saw a massive memleak about 1 hour into the scrub. One OSD claimed over 53GByte within 10 minutes. We had to restart the OSD to keep the cluster stable. Another OSD is currently claiming about 27GByte and will be restarted soon. All circumstantial evidence points to the deep scrub as the source of the leak. One affected node is running libc 2.15 (Ubuntu 12.04 LTS), the other one is using libc 2.11.3 (Debian Squeeze). So it seems this is not a libc-dependant issue. We have disabled scrub completely. Regards, --ck PS: Do we have any idea when this will be fixed? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
+1 -- Regards, Sébastien Han. On Sat, Feb 16, 2013 at 10:09 AM, Wido den Hollander w...@42on.com wrote: On 02/16/2013 08:09 AM, Andrey Korolyov wrote: Can anyone who hit this bug please confirm that your system contains libc 2.15+? I've seen this with 0.56.2 as well on Ubuntu 12.04. Ubuntu 12.04 comes with 2.15-0ubuntu10.3 Haven't gotten around to adding a heap profiler to it. Wido On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han han.sebast...@gmail.com wrote: oh nice, the pattern also matches path :D, didn't know that thanks Greg -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote: Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core -Greg On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote: ok I finally managed to get something on my test cluster, unfortunately, the dump goes to / any idea to change the destination path? My production / won't be big enough... -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote: ...and/or do you have the corepath set interestingly, or one of the core-trapping mechanisms turned on? On 02/04/2013 11:29 AM, Sage Weil wrote: On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to
Re: [0.48.3] OSD memory leak when scrubbing
Can anyone who hit this bug please confirm that your system contains libc 2.15+? On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han han.sebast...@gmail.com wrote: oh nice, the pattern also matches path :D, didn't know that thanks Greg -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote: Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core -Greg On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote: ok I finally managed to get something on my test cluster, unfortunately, the dump goes to / any idea to change the destination path? My production / won't be big enough... -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote: ...and/or do you have the corepath set interestingly, or one of the core-trapping mechanisms turned on? On 02/04/2013 11:29 AM, Sage Weil wrote: On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? -- Regards, Sébastien Han. On Sun, Feb 3, 2013 at 10:03 PM, Sébastien Han han.sebast...@gmail.com wrote: Hi Loïc, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, Sébastien Han. On Sun, Feb 3, 2013 at 10:01 PM, Sébastien Han han.sebast...@gmail.com wrote: Hi Loïc, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, Sébastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
...and/or do you have the corepath set interestingly, or one of the core-trapping mechanisms turned on? On 02/04/2013 11:29 AM, Sage Weil wrote: On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
ok I finally managed to get something on my test cluster, unfortunately, the dump goes to / any idea to change the destination path? My production / won't be big enough... -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote: ...and/or do you have the corepath set interestingly, or one of the core-trapping mechanisms turned on? On 02/04/2013 11:29 AM, Sage Weil wrote: On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core -Greg On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote: ok I finally managed to get something on my test cluster, unfortunately, the dump goes to / any idea to change the destination path? My production / won't be big enough... -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote: ...and/or do you have the corepath set interestingly, or one of the core-trapping mechanisms turned on? On 02/04/2013 11:29 AM, Sage Weil wrote: On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
oh nice, the pattern also matches path :D, didn't know that thanks Greg -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote: Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core -Greg On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote: ok I finally managed to get something on my test cluster, unfortunately, the dump goes to / any idea to change the destination path? My production / won't be big enough... -- Regards, Sébastien Han. On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote: ...and/or do you have the corepath set interestingly, or one of the core-trapping mechanisms turned on? On 02/04/2013 11:29 AM, Sage Weil wrote: On Mon, 4 Feb 2013, S?bastien Han wrote: Hum just tried several times on my test cluster and I can't get any core dump. Does Ceph commit suicide or something? Is it expected behavior? SIGSEGV should trigger the usual path that dumps a stack trace and then dumps core. Was your ulimit -c set before the daemon was started? sage -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote: Hi Lo?c, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, S?bastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Lo?c Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature
Re: [0.48.3] OSD memory leak when scrubbing
Hi Loïc, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheer -- Regards, Sébastien Han. On Sun, Feb 3, 2013 at 10:01 PM, Sébastien Han han.sebast...@gmail.com wrote: Hi Loïc, Thanks for bringing our discussion on the ML. I'll check that tomorrow :-). Cheers -- Regards, Sébastien Han. On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote: Hi, As discussed during FOSDEM, the script you wrote to kill the OSD when it grows too much could be amended to core dump instead of just being killed restarted. The binary + core could probably be used to figure out where the leak is. You should make sure the OSD current working directory is in a file system with enough free disk space to accomodate for the dump and set ulimit -c unlimited before running it ( your system default is probably ulimit -c 0 which inhibits core dumps ). When you detect that OSD grows too much kill it with kill -SEGV $pid and upload the core found in the working directory, together with the binary in a public place. If the osd binary is compiled with -g but without changing the -O settings, you should have a larger binary file but no negative impact on performances. Forensics analysis will be made a lot easier with the debugging symbols. My 2cts On 01/31/2013 08:57 PM, Sage Weil wrote: On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, I'm crossing my fingers, but I just noticed that since I upgraded to kernel version 3.2.0-36-generic on Ubuntu 12.04 the other day, ceph-osd memory usage has stayed stable. Unfortunately for me, I'm already on 3.2.0-36-generic (Ubuntu 12.04 as well). Cheers, Sylvain PS: Dave sorry for the double, I forgot reply-to-all ... -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
On Thu, 31 Jan 2013, Sylvain Munaut wrote: Hi, I disabled scrubbing using ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' and the leak seems to be gone. See the graph at http://i.imgur.com/A0KmVot.png with the OSD memory for the 12 osd processes over the last 3.5 days. Memory was rising every 24h. I did the change yesterday around 13h00 and OSDs stopped growing. OSD memory even seems to go down slowly by small blocks. Of course I assume disabling scrubbing is not a long term solution and I should re-enable it ... (how do I do that btw ? what were the default values for those parameters) It depends on the exact commit you're on. You can see the defaults if you do ceph-osd --show-config | grep osd_scrub Thanks for testing this... I have a few other ideas to try to reproduce. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Just to keep you posted, upgraded our cluster yesterday to a custom compiled 0.56.1 and it has now been more than 24h and there is no sign on memory leak anymore. Previously it would rise by ~ 100 M every 24h almost like clock work and now, it's been slightly more than 24h and memory is stable. (it fluctuates, but no large jumps that stay forever). That's great news. We've been trying to replicate the argonaut leak here on argonaut and haven't succeeded so far. I'm sorry to report that my excitement was premature ... it didn't grow during the first 24h but each day since then has seen a 100 M increase of OSD memory, so pretty much the same behavior as before. And again, happens when scrubbing PGs from the rbd pool. :( Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
On Wed, 30 Jan 2013, Sylvain Munaut wrote: Just to keep you posted, upgraded our cluster yesterday to a custom compiled 0.56.1 and it has now been more than 24h and there is no sign on memory leak anymore. Previously it would rise by ~ 100 M every 24h almost like clock work and now, it's been slightly more than 24h and memory is stable. (it fluctuates, but no large jumps that stay forever). That's great news. We've been trying to replicate the argonaut leak here on argonaut and haven't succeeded so far. I'm sorry to report that my excitement was premature ... it didn't grow during the first 24h but each day since then has seen a 100 M increase of OSD memory, so pretty much the same behavior as before. And again, happens when scrubbing PGs from the rbd pool. Can you try disabling scrubbing and see if the leak stops? ceph osd tell \* injectargs '--osd-scrub-load-threshold .01' (that will work for 0.56.1, but is fixed in later versions, btw.) On newer code, ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' Tracking this via http://tracker.ceph.com/issues/3883 Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, Can you try disabling scrubbing and see if the leak stops? ceph osd tell \* injectargs '--osd-scrub-load-threshold .01' (that will work for 0.56.1, but is fixed in later versions, btw.) On newer code, ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' Ok, I just did that. (I have 0.56.1 + a few more patches from the bobtail branch (up to c5fe0965572c07... ) I'll report back tomorrow. Tracking this via http://tracker.ceph.com/issues/3883 Should I post the updates on the ML or on the ticket ? Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
On Wed, 30 Jan 2013, Sylvain Munaut wrote: Hi, Can you try disabling scrubbing and see if the leak stops? ceph osd tell \* injectargs '--osd-scrub-load-threshold .01' (that will work for 0.56.1, but is fixed in later versions, btw.) On newer code, ceph osd tell \* injectargs '--osd-scrub-min-interval 100' ceph osd tell \* injectargs '--osd-scrub-max-interval 1000' Ok, I just did that. (I have 0.56.1 + a few more patches from the bobtail branch (up to c5fe0965572c07... ) I'll report back tomorrow. Tracking this via http://tracker.ceph.com/issues/3883 Should I post the updates on the ML or on the ticket ? Either or both. We try to keep the ticket up to date, either way. Thanks! s -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, Just to keep you posted, upgraded our cluster yesterday to a custom compiled 0.56.1 and it has now been more than 24h and there is no sign on memory leak anymore. Previously it would rise by ~ 100 M every 24h almost like clock work and now, it's been slightly more than 24h and memory is stable. (it fluctuates, but no large jumps that stay forever). Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
On Sun, 27 Jan 2013, Sylvain Munaut wrote: Hi, Just to keep you posted, upgraded our cluster yesterday to a custom compiled 0.56.1 and it has now been more than 24h and there is no sign on memory leak anymore. Previously it would rise by ~ 100 M every 24h almost like clock work and now, it's been slightly more than 24h and memory is stable. (it fluctuates, but no large jumps that stay forever). That's great news. We've been trying to replicate the argonaut leak here on argonaut and haven't succeeded so far. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, Just to keep you posted, upgraded our cluster yesterday to a custom compiled 0.56.1 and it has now been more than 24h and there is no sign on memory leak anymore. Previously it would rise by ~ 100 M every 24h almost like clock work and now, it's been slightly more than 24h and memory is stable. (it fluctuates, but no large jumps that stay forever). That's great news. We've been trying to replicate the argonaut leak here on argonaut and haven't succeeded so far. To be entirely complete, I also upgraded the kernel RBD client and since the leak happened while scrubbing the RBD pool, maybe the client behavior makes a difference.. Previously they were running kernel 3.6.8, they're now running 3.6.11 with all the ceph related patch from 3.8 backported ( ~ 150 patches ). Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, Could provide those heaps? Is it possible? -- Regards, Sébastien Han. On Tue, Jan 22, 2013 at 10:38 PM, Sébastien Han han.sebast...@gmail.com wrote: Well ideally you want to run the profiler during the scrubbing process when the memory leaks appear :-). -- Regards, Sébastien Han. On Tue, Jan 22, 2013 at 10:32 PM, Sylvain Munaut s.mun...@whatever-company.com wrote: Hi, I don't really want to try the mem profiler, I had quite a bad experience with it on a test cluster. While running the profiler some OSD crashed... The only way to fix this is to provide a heap dump. Could you provide one? I just did: ceph osd tell 0 heap start_profiler ceph osd tell 0 heap dump ceph osd tell 0 heap stop_profiler and it produced osd.0.profile.0001.heap Is it enough or do I actually have to leave it running ? I had to stop the profiler because after doing the dump, the OSD process was taking 100% of CPU ... stopping the profiler restored it to normal. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Could provide those heaps? Is it possible? We're updating this weekend to 0.56.1. If it still happens after the update, I'll try and reproduce it on our test infra and do the profile there, because unfortunately running the profiler seem to make it eat up CPU and RAM a lot ... I also need to test is it happens when I force a scrub myself because I can't let the profile run the whole day and just wait for it to happen naturally, so I need a way to trigger a scrub of all PGs on a given pool. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[0.48.3] OSD memory leak when scrubbing
Hi, Since I have ceph in prod, I experienced a memory leak in the OSD forcing to restart them every 5 or 6 days. Without that the OSD process just grows infinitely and eventually gets killed by the OOM killer. (To make sure it wasn't legitimate, I left one grow up to 4G or RSS ...). Here's for example the RSS usage of the 12 OSDs process http://i.imgur.com/ZJxyldq.png during a few hours. What I've just noticed is that if I look at the logs of the osd process right when it grows, I can see it's scrubbing PGs from pool #3. When scrubbing PGs from other pools, nothing really happens memory wise. Pool #3 is the pool where I have all the RBD images for the VMs and so have a bunch of small read/write/modify. The other pools are used by RGW for object storage and are mostly write-once,read-many-times of relatively large objects. I'm planning to upgrade to 0.56.1 this week end and I was hoping to see if someone knew if that issue had been fixed with the scrubbing code ? I've seen other posts about memory leaks but at the time, it wasn't confirmed what was the source. Here I clearly see it's the scrubbing on pools that have RBD image. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, I originally started a thread around these memory leaks problems here: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11000.html I'm happy to see that someone supports my theory about the scrubbing process leaking the memory. I only use RBD from Ceph, so your theory makes sense as well. Unfortunately, since I run a production platform I don't really want to try the mem profiler, I had quite a bad experience with it on a test cluster. While running the profiler some OSD crashed... The only way to fix this is to provide a heap dump. Could you provide one? Moreover I can't reproduce the problem on my test environment... :( -- Regards, Sébastien Han. On Tue, Jan 22, 2013 at 9:01 PM, Sylvain Munaut s.mun...@whatever-company.com wrote: Hi, Since I have ceph in prod, I experienced a memory leak in the OSD forcing to restart them every 5 or 6 days. Without that the OSD process just grows infinitely and eventually gets killed by the OOM killer. (To make sure it wasn't legitimate, I left one grow up to 4G or RSS ...). Here's for example the RSS usage of the 12 OSDs process http://i.imgur.com/ZJxyldq.png during a few hours. What I've just noticed is that if I look at the logs of the osd process right when it grows, I can see it's scrubbing PGs from pool #3. When scrubbing PGs from other pools, nothing really happens memory wise. Pool #3 is the pool where I have all the RBD images for the VMs and so have a bunch of small read/write/modify. The other pools are used by RGW for object storage and are mostly write-once,read-many-times of relatively large objects. I'm planning to upgrade to 0.56.1 this week end and I was hoping to see if someone knew if that issue had been fixed with the scrubbing code ? I've seen other posts about memory leaks but at the time, it wasn't confirmed what was the source. Here I clearly see it's the scrubbing on pools that have RBD image. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Hi, I don't really want to try the mem profiler, I had quite a bad experience with it on a test cluster. While running the profiler some OSD crashed... The only way to fix this is to provide a heap dump. Could you provide one? I just did: ceph osd tell 0 heap start_profiler ceph osd tell 0 heap dump ceph osd tell 0 heap stop_profiler and it produced osd.0.profile.0001.heap Is it enough or do I actually have to leave it running ? I had to stop the profiler because after doing the dump, the OSD process was taking 100% of CPU ... stopping the profiler restored it to normal. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [0.48.3] OSD memory leak when scrubbing
Well ideally you want to run the profiler during the scrubbing process when the memory leaks appear :-). -- Regards, Sébastien Han. On Tue, Jan 22, 2013 at 10:32 PM, Sylvain Munaut s.mun...@whatever-company.com wrote: Hi, I don't really want to try the mem profiler, I had quite a bad experience with it on a test cluster. While running the profiler some OSD crashed... The only way to fix this is to provide a heap dump. Could you provide one? I just did: ceph osd tell 0 heap start_profiler ceph osd tell 0 heap dump ceph osd tell 0 heap stop_profiler and it produced osd.0.profile.0001.heap Is it enough or do I actually have to leave it running ? I had to stop the profiler because after doing the dump, the OSD process was taking 100% of CPU ... stopping the profiler restored it to normal. Cheers, Sylvain -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html