Re: 0.56 scrub OSD memleaks, WAS Re: [0.48.3] OSD memory leak when scrubbing

2013-02-19 Thread Samuel Just
Can you confirm that the memory size reported is res?
-Sam

On Mon, Feb 18, 2013 at 8:46 AM, Christopher Kunz chrisl...@de-punkt.de wrote:
 Am 16.02.13 10:09, schrieb Wido den Hollander:
 On 02/16/2013 08:09 AM, Andrey Korolyov wrote:
 Can anyone who hit this bug please confirm that your system contains
 libc 2.15+?


 Hello,

 when we started a deep scrub on our 0.56.2 cluster today, we saw a
 massive memleak about 1 hour into the scrub. One OSD claimed over
 53GByte within 10 minutes. We had to restart the OSD to keep the cluster
 stable.

 Another OSD is currently claiming about 27GByte and will be restarted
 soon. All circumstantial evidence points to the deep scrub as the source
 of the leak.

 One affected node is running libc 2.15 (Ubuntu 12.04 LTS), the other one
 is using libc 2.11.3 (Debian Squeeze). So it seems this is not a
 libc-dependant issue.

 We have disabled scrub completely.

 Regards,

 --ck

 PS: Do we have any idea when this will be fixed?
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


0.56 scrub OSD memleaks, WAS Re: [0.48.3] OSD memory leak when scrubbing

2013-02-18 Thread Christopher Kunz
Am 16.02.13 10:09, schrieb Wido den Hollander:
 On 02/16/2013 08:09 AM, Andrey Korolyov wrote:
 Can anyone who hit this bug please confirm that your system contains
 libc 2.15+?

 
Hello,

when we started a deep scrub on our 0.56.2 cluster today, we saw a
massive memleak about 1 hour into the scrub. One OSD claimed over
53GByte within 10 minutes. We had to restart the OSD to keep the cluster
stable.

Another OSD is currently claiming about 27GByte and will be restarted
soon. All circumstantial evidence points to the deep scrub as the source
of the leak.

One affected node is running libc 2.15 (Ubuntu 12.04 LTS), the other one
is using libc 2.11.3 (Debian Squeeze). So it seems this is not a
libc-dependant issue.

We have disabled scrub completely.

Regards,

--ck

PS: Do we have any idea when this will be fixed?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-17 Thread Sébastien Han
+1
--
Regards,
Sébastien Han.


On Sat, Feb 16, 2013 at 10:09 AM, Wido den Hollander w...@42on.com wrote:
 On 02/16/2013 08:09 AM, Andrey Korolyov wrote:

 Can anyone who hit this bug please confirm that your system contains libc
 2.15+?


 I've seen this with 0.56.2 as well on Ubuntu 12.04. Ubuntu 12.04 comes with
 2.15-0ubuntu10.3

 Haven't gotten around to adding a heap profiler to it.

 Wido


 On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han han.sebast...@gmail.com
 wrote:

 oh nice, the pattern also matches path :D, didn't know that
 thanks Greg
 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote:

 Set your /proc/sys/kernel/core_pattern file. :)
 http://linux.die.net/man/5/core
 -Greg

 On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com
 wrote:

 ok I finally managed to get something on my test cluster,
 unfortunately, the dump goes to /

 any idea to change the destination path?

 My production / won't be big enough...

 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote:

 ...and/or do you have the corepath set interestingly, or one of the
 core-trapping mechanisms turned on?


 On 02/04/2013 11:29 AM, Sage Weil wrote:


 On Mon, 4 Feb 2013, S?bastien Han wrote:


 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?



 SIGSEGV should trigger the usual path that dumps a stack trace and
 then
 dumps core.  Was your ulimit -c set before the daemon was started?

 sage



 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han
 han.sebast...@gmail.com
 wrote:


 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that
 tomorrow
 :-).

 Cheer
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han
 han.sebast...@gmail.com
 wrote:


 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that
 tomorrow
 :-).

 Cheers

 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org
 wrote:



 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD
 when
 it
 grows too much could be amended to core dump instead of just
 being
 killed 
 restarted. The binary + core could probably be used to figure out
 where the
 leak is.

 You should make sure the OSD current working directory is in a
 file
 system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0
 which
 inhibits core dumps ). When you detect that OSD grows too much
 kill it
 with

 kill -SEGV $pid

 and upload the core found in the working directory, together with
 the
 binary in a public place. If the osd binary is compiled with -g
 but
 without
 changing the -O settings, you should have a larger binary file
 but no
 negative impact on performances. Forensics analysis will be made
 a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:


 On Thu, 31 Jan 2013, Sylvain Munaut wrote:


 Hi,

 I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval
 1000'



 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
 memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around
 13h00
 and OSDs stopped growing. OSD memory even seems to go down
 slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term
 solution
 and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)



 It depends on the exact commit you're on.  You can see the
 defaults
 if
 you
 do

ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe
 ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at
 http://vger.kernel.org/majordomo-info.html



 --
 Lo?c Dachary, Artisan Logiciel Libre




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to 

Re: [0.48.3] OSD memory leak when scrubbing

2013-02-15 Thread Andrey Korolyov
Can anyone who hit this bug please confirm that your system contains libc 2.15+?

On Tue, Feb 5, 2013 at 1:27 AM, Sébastien Han han.sebast...@gmail.com wrote:
 oh nice, the pattern also matches path :D, didn't know that
 thanks Greg
 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote:
 Set your /proc/sys/kernel/core_pattern file. :) 
 http://linux.die.net/man/5/core
 -Greg

 On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com 
 wrote:
 ok I finally managed to get something on my test cluster,
 unfortunately, the dump goes to /

 any idea to change the destination path?

 My production / won't be big enough...

 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote:
 ...and/or do you have the corepath set interestingly, or one of the
 core-trapping mechanisms turned on?


 On 02/04/2013 11:29 AM, Sage Weil wrote:

 On Mon, 4 Feb 2013, S?bastien Han wrote:

 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?


 SIGSEGV should trigger the usual path that dumps a stack trace and then
 dumps core.  Was your ulimit -c set before the daemon was started?

 sage



 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheer
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheers

 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:


 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when
 it
 grows too much could be amended to core dump instead of just being
 killed 
 restarted. The binary + core could probably be used to figure out
 where the
 leak is.

 You should make sure the OSD current working directory is in a file
 system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it
 with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but
 without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:

 On Thu, 31 Jan 2013, Sylvain Munaut wrote:

 Hi,

 I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
 memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term solution
 and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)


 It depends on the exact commit you're on.  You can see the defaults
 if
 you
 do

   ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Lo?c Dachary, Artisan Logiciel Libre




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Sébastien Han
Hum just tried several times on my test cluster and I can't get any
core dump. Does Ceph commit suicide or something? Is it expected
behavior?
--
Regards,
Sébastien Han.


On Sun, Feb 3, 2013 at 10:03 PM, Sébastien Han han.sebast...@gmail.com wrote:
 Hi Loïc,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

 Cheer
 --
 Regards,
 Sébastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, Sébastien Han han.sebast...@gmail.com 
 wrote:
 Hi Loïc,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

 Cheers

 --
 Regards,
 Sébastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:

 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when it
 grows too much could be amended to core dump instead of just being killed 
 restarted. The binary + core could probably be used to figure out where the
 leak is.

 You should make sure the OSD current working directory is in a file system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:
  On Thu, 31 Jan 2013, Sylvain Munaut wrote:
  Hi,
 
  I disabled scrubbing using
 
  ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
  ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
 
  and the leak seems to be gone.
 
  See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
  for the 12 osd processes over the last 3.5 days.
  Memory was rising every 24h. I did the change yesterday around 13h00
  and OSDs stopped growing. OSD memory even seems to go down slowly by
  small blocks.
 
  Of course I assume disabling scrubbing is not a long term solution and
  I should re-enable it ... (how do I do that btw ? what were the
  default values for those parameters)
 
  It depends on the exact commit you're on.  You can see the defaults if
  you
  do
 
   ceph-osd --show-config | grep osd_scrub
 
  Thanks for testing this... I have a few other ideas to try to reproduce.
 
  sage
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 Loïc Dachary, Artisan Logiciel Libre


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Sage Weil
On Mon, 4 Feb 2013, S?bastien Han wrote:
 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?

SIGSEGV should trigger the usual path that dumps a stack trace and then 
dumps core.  Was your ulimit -c set before the daemon was started?

sage



 --
 Regards,
 S?bastien Han.
 
 
 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com 
 wrote:
  Hi Lo?c,
 
  Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
 
  Cheer
  --
  Regards,
  S?bastien Han.
 
 
  On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com 
  wrote:
  Hi Lo?c,
 
  Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).
 
  Cheers
 
  --
  Regards,
  S?bastien Han.
 
 
  On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:
 
  Hi,
 
  As discussed during FOSDEM, the script you wrote to kill the OSD when it
  grows too much could be amended to core dump instead of just being killed 
  
  restarted. The binary + core could probably be used to figure out where 
  the
  leak is.
 
  You should make sure the OSD current working directory is in a file system
  with enough free disk space to accomodate for the dump and set
 
  ulimit -c unlimited
 
  before running it ( your system default is probably ulimit -c 0 which
  inhibits core dumps ). When you detect that OSD grows too much kill it 
  with
 
  kill -SEGV $pid
 
  and upload the core found in the working directory, together with the
  binary in a public place. If the osd binary is compiled with -g but 
  without
  changing the -O settings, you should have a larger binary file but no
  negative impact on performances. Forensics analysis will be made a lot
  easier with the debugging symbols.
 
  My 2cts
 
  On 01/31/2013 08:57 PM, Sage Weil wrote:
   On Thu, 31 Jan 2013, Sylvain Munaut wrote:
   Hi,
  
   I disabled scrubbing using
  
   ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
   ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
  
   and the leak seems to be gone.
  
   See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
   for the 12 osd processes over the last 3.5 days.
   Memory was rising every 24h. I did the change yesterday around 13h00
   and OSDs stopped growing. OSD memory even seems to go down slowly by
   small blocks.
  
   Of course I assume disabling scrubbing is not a long term solution and
   I should re-enable it ... (how do I do that btw ? what were the
   default values for those parameters)
  
   It depends on the exact commit you're on.  You can see the defaults if
   you
   do
  
ceph-osd --show-config | grep osd_scrub
  
   Thanks for testing this... I have a few other ideas to try to reproduce.
  
   sage
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
  --
  Lo?c Dachary, Artisan Logiciel Libre
 
 
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Dan Mick
...and/or do you have the corepath set interestingly, or one of the 
core-trapping mechanisms turned on?


On 02/04/2013 11:29 AM, Sage Weil wrote:

On Mon, 4 Feb 2013, S?bastien Han wrote:

Hum just tried several times on my test cluster and I can't get any
core dump. Does Ceph commit suicide or something? Is it expected
behavior?


SIGSEGV should trigger the usual path that dumps a stack trace and then
dumps core.  Was your ulimit -c set before the daemon was started?

sage




--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com wrote:

Hi Lo?c,

Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

Cheer
--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com wrote:

Hi Lo?c,

Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

Cheers

--
Regards,
S?bastien Han.


On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:


Hi,

As discussed during FOSDEM, the script you wrote to kill the OSD when it
grows too much could be amended to core dump instead of just being killed 
restarted. The binary + core could probably be used to figure out where the
leak is.

You should make sure the OSD current working directory is in a file system
with enough free disk space to accomodate for the dump and set

ulimit -c unlimited

before running it ( your system default is probably ulimit -c 0 which
inhibits core dumps ). When you detect that OSD grows too much kill it with

kill -SEGV $pid

and upload the core found in the working directory, together with the
binary in a public place. If the osd binary is compiled with -g but without
changing the -O settings, you should have a larger binary file but no
negative impact on performances. Forensics analysis will be made a lot
easier with the debugging symbols.

My 2cts

On 01/31/2013 08:57 PM, Sage Weil wrote:

On Thu, 31 Jan 2013, Sylvain Munaut wrote:

Hi,

I disabled scrubbing using


ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


and the leak seems to be gone.

See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
for the 12 osd processes over the last 3.5 days.
Memory was rising every 24h. I did the change yesterday around 13h00
and OSDs stopped growing. OSD memory even seems to go down slowly by
small blocks.

Of course I assume disabling scrubbing is not a long term solution and
I should re-enable it ... (how do I do that btw ? what were the
default values for those parameters)


It depends on the exact commit you're on.  You can see the defaults if
you
do

  ceph-osd --show-config | grep osd_scrub

Thanks for testing this... I have a few other ideas to try to reproduce.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Lo?c Dachary, Artisan Logiciel Libre







--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Sébastien Han
ok I finally managed to get something on my test cluster,
unfortunately, the dump goes to /

any idea to change the destination path?

My production / won't be big enough...

--
Regards,
Sébastien Han.


On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote:
 ...and/or do you have the corepath set interestingly, or one of the
 core-trapping mechanisms turned on?


 On 02/04/2013 11:29 AM, Sage Weil wrote:

 On Mon, 4 Feb 2013, S?bastien Han wrote:

 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?


 SIGSEGV should trigger the usual path that dumps a stack trace and then
 dumps core.  Was your ulimit -c set before the daemon was started?

 sage



 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheer
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheers

 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:


 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when
 it
 grows too much could be amended to core dump instead of just being
 killed 
 restarted. The binary + core could probably be used to figure out
 where the
 leak is.

 You should make sure the OSD current working directory is in a file
 system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it
 with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but
 without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:

 On Thu, 31 Jan 2013, Sylvain Munaut wrote:

 Hi,

 I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
 memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term solution
 and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)


 It depends on the exact commit you're on.  You can see the defaults
 if
 you
 do

   ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Lo?c Dachary, Artisan Logiciel Libre




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Gregory Farnum
Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core
-Greg

On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote:
 ok I finally managed to get something on my test cluster,
 unfortunately, the dump goes to /

 any idea to change the destination path?

 My production / won't be big enough...

 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote:
 ...and/or do you have the corepath set interestingly, or one of the
 core-trapping mechanisms turned on?


 On 02/04/2013 11:29 AM, Sage Weil wrote:

 On Mon, 4 Feb 2013, S?bastien Han wrote:

 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?


 SIGSEGV should trigger the usual path that dumps a stack trace and then
 dumps core.  Was your ulimit -c set before the daemon was started?

 sage



 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheer
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheers

 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:


 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when
 it
 grows too much could be amended to core dump instead of just being
 killed 
 restarted. The binary + core could probably be used to figure out
 where the
 leak is.

 You should make sure the OSD current working directory is in a file
 system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it
 with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but
 without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:

 On Thu, 31 Jan 2013, Sylvain Munaut wrote:

 Hi,

 I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
 memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term solution
 and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)


 It depends on the exact commit you're on.  You can see the defaults
 if
 you
 do

   ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Lo?c Dachary, Artisan Logiciel Libre




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Sébastien Han
oh nice, the pattern also matches path :D, didn't know that
thanks Greg
--
Regards,
Sébastien Han.


On Mon, Feb 4, 2013 at 10:22 PM, Gregory Farnum g...@inktank.com wrote:
 Set your /proc/sys/kernel/core_pattern file. :) 
 http://linux.die.net/man/5/core
 -Greg

 On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote:
 ok I finally managed to get something on my test cluster,
 unfortunately, the dump goes to /

 any idea to change the destination path?

 My production / won't be big enough...

 --
 Regards,
 Sébastien Han.


 On Mon, Feb 4, 2013 at 10:03 PM, Dan Mick dan.m...@inktank.com wrote:
 ...and/or do you have the corepath set interestingly, or one of the
 core-trapping mechanisms turned on?


 On 02/04/2013 11:29 AM, Sage Weil wrote:

 On Mon, 4 Feb 2013, S?bastien Han wrote:

 Hum just tried several times on my test cluster and I can't get any
 core dump. Does Ceph commit suicide or something? Is it expected
 behavior?


 SIGSEGV should trigger the usual path that dumps a stack trace and then
 dumps core.  Was your ulimit -c set before the daemon was started?

 sage



 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:03 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheer
 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 10:01 PM, S?bastien Han han.sebast...@gmail.com
 wrote:

 Hi Lo?c,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow
 :-).

 Cheers

 --
 Regards,
 S?bastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:


 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when
 it
 grows too much could be amended to core dump instead of just being
 killed 
 restarted. The binary + core could probably be used to figure out
 where the
 leak is.

 You should make sure the OSD current working directory is in a file
 system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it
 with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but
 without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:

 On Thu, 31 Jan 2013, Sylvain Munaut wrote:

 Hi,

 I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'


 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD
 memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term solution
 and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)


 It depends on the exact commit you're on.  You can see the defaults
 if
 you
 do

   ceph-osd --show-config | grep osd_scrub

 Thanks for testing this... I have a few other ideas to try to
 reproduce.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Lo?c Dachary, Artisan Logiciel Libre




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-03 Thread Loic Dachary
Hi,

As discussed during FOSDEM, the script you wrote to kill the OSD when it grows 
too much could be amended to core dump instead of just being killed  
restarted. The binary + core could probably be used to figure out where the 
leak is.

You should make sure the OSD current working directory is in a file system with 
enough free disk space to accomodate for the dump and set

ulimit -c unlimited

before running it ( your system default is probably ulimit -c 0 which inhibits 
core dumps ). When you detect that OSD grows too much kill it with

kill -SEGV $pid

and upload the core found in the working directory, together with the binary in 
a public place. If the osd binary is compiled with -g but without changing the 
-O settings, you should have a larger binary file but no negative impact on 
performances. Forensics analysis will be made a lot easier with the debugging 
symbols. 

My 2cts

On 01/31/2013 08:57 PM, Sage Weil wrote:
 On Thu, 31 Jan 2013, Sylvain Munaut wrote:
 Hi,

 I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'

 and the leak seems to be gone.

 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.

 Of course I assume disabling scrubbing is not a long term solution and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)
 
 It depends on the exact commit you're on.  You can see the defaults if you 
 do
 
  ceph-osd --show-config | grep osd_scrub
 
 Thanks for testing this... I have a few other ideas to try to reproduce.  
 
 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature


Re: [0.48.3] OSD memory leak when scrubbing

2013-02-03 Thread Sébastien Han
Hi Loïc,

Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

Cheer
--
Regards,
Sébastien Han.


On Sun, Feb 3, 2013 at 10:01 PM, Sébastien Han han.sebast...@gmail.com wrote:
 Hi Loïc,

 Thanks for bringing our discussion on the ML. I'll check that tomorrow :-).

 Cheers

 --
 Regards,
 Sébastien Han.


 On Sun, Feb 3, 2013 at 7:17 PM, Loic Dachary l...@dachary.org wrote:

 Hi,

 As discussed during FOSDEM, the script you wrote to kill the OSD when it
 grows too much could be amended to core dump instead of just being killed 
 restarted. The binary + core could probably be used to figure out where the
 leak is.

 You should make sure the OSD current working directory is in a file system
 with enough free disk space to accomodate for the dump and set

 ulimit -c unlimited

 before running it ( your system default is probably ulimit -c 0 which
 inhibits core dumps ). When you detect that OSD grows too much kill it with

 kill -SEGV $pid

 and upload the core found in the working directory, together with the
 binary in a public place. If the osd binary is compiled with -g but without
 changing the -O settings, you should have a larger binary file but no
 negative impact on performances. Forensics analysis will be made a lot
 easier with the debugging symbols.

 My 2cts

 On 01/31/2013 08:57 PM, Sage Weil wrote:
  On Thu, 31 Jan 2013, Sylvain Munaut wrote:
  Hi,
 
  I disabled scrubbing using
 
  ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
  ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
 
  and the leak seems to be gone.
 
  See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
  for the 12 osd processes over the last 3.5 days.
  Memory was rising every 24h. I did the change yesterday around 13h00
  and OSDs stopped growing. OSD memory even seems to go down slowly by
  small blocks.
 
  Of course I assume disabling scrubbing is not a long term solution and
  I should re-enable it ... (how do I do that btw ? what were the
  default values for those parameters)
 
  It depends on the exact commit you're on.  You can see the defaults if
  you
  do
 
   ceph-osd --show-config | grep osd_scrub
 
  Thanks for testing this... I have a few other ideas to try to reproduce.
 
  sage
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 Loïc Dachary, Artisan Logiciel Libre


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-31 Thread Sylvain Munaut
Hi,

I disabled scrubbing using

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'

and the leak seems to be gone.

See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
for the 12 osd processes over the last 3.5 days.
Memory was rising every 24h. I did the change yesterday around 13h00
and OSDs stopped growing. OSD memory even seems to go down slowly by
small blocks.

Of course I assume disabling scrubbing is not a long term solution and
I should re-enable it ... (how do I do that btw ? what were the
default values for those parameters)

Cheers,

   Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-31 Thread Sylvain Munaut
Hi,

 I'm crossing my fingers, but I just noticed that since I upgraded to kernel
 version 3.2.0-36-generic on Ubuntu 12.04 the other day, ceph-osd memory
 usage has stayed stable.

Unfortunately for me, I'm already on 3.2.0-36-generic  (Ubuntu 12.04 as well).

Cheers,

Sylvain


PS: Dave sorry for the double, I forgot reply-to-all ...
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-31 Thread Sage Weil
On Thu, 31 Jan 2013, Sylvain Munaut wrote:
 Hi,
 
 I disabled scrubbing using
 
  ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
  ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
 
 and the leak seems to be gone.
 
 See the graph at  http://i.imgur.com/A0KmVot.png  with the OSD memory
 for the 12 osd processes over the last 3.5 days.
 Memory was rising every 24h. I did the change yesterday around 13h00
 and OSDs stopped growing. OSD memory even seems to go down slowly by
 small blocks.
 
 Of course I assume disabling scrubbing is not a long term solution and
 I should re-enable it ... (how do I do that btw ? what were the
 default values for those parameters)

It depends on the exact commit you're on.  You can see the defaults if you 
do

 ceph-osd --show-config | grep osd_scrub

Thanks for testing this... I have a few other ideas to try to reproduce.  

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-30 Thread Sylvain Munaut
 Just to keep you posted,  upgraded our cluster yesterday to a custom
 compiled 0.56.1 and it has now been more than 24h and there is no sign
 on memory leak anymore. Previously it would rise by ~ 100 M every 24h
 almost like clock work and now, it's been slightly more than 24h and
 memory is stable. (it fluctuates, but no large jumps that stay
 forever).

 That's great news.  We've been trying to replicate the argonaut leak here
 on argonaut and haven't succeeded so far.

I'm sorry to report that my excitement was premature ...  it didn't
grow during the first 24h but each day since then has seen a 100 M
increase of OSD memory, so pretty much the same behavior as before.
And again, happens when scrubbing PGs from the rbd pool.


:(

Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-30 Thread Sage Weil
On Wed, 30 Jan 2013, Sylvain Munaut wrote:
  Just to keep you posted,  upgraded our cluster yesterday to a custom
  compiled 0.56.1 and it has now been more than 24h and there is no sign
  on memory leak anymore. Previously it would rise by ~ 100 M every 24h
  almost like clock work and now, it's been slightly more than 24h and
  memory is stable. (it fluctuates, but no large jumps that stay
  forever).
 
  That's great news.  We've been trying to replicate the argonaut leak here
  on argonaut and haven't succeeded so far.
 
 I'm sorry to report that my excitement was premature ...  it didn't
 grow during the first 24h but each day since then has seen a 100 M
 increase of OSD memory, so pretty much the same behavior as before.
 And again, happens when scrubbing PGs from the rbd pool.

Can you try disabling scrubbing and see if the leak stops?

ceph osd tell \* injectargs '--osd-scrub-load-threshold .01'

(that will work for 0.56.1, but is fixed in later versions, btw.)  On 
newer code,

ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'

Tracking this via

http://tracker.ceph.com/issues/3883

Thanks!
sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-30 Thread Sylvain Munaut
Hi,


 Can you try disabling scrubbing and see if the leak stops?

 ceph osd tell \* injectargs '--osd-scrub-load-threshold .01'

 (that will work for 0.56.1, but is fixed in later versions, btw.)  On
 newer code,

 ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
 ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'

Ok, I just did that.
(I have 0.56.1 + a few more patches from the bobtail branch (up to
c5fe0965572c07... )

I'll report back tomorrow.


 Tracking this via

 http://tracker.ceph.com/issues/3883

Should I post the updates on the ML or on the ticket ?

Cheers,

   Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-30 Thread Sage Weil
On Wed, 30 Jan 2013, Sylvain Munaut wrote:
 Hi,
 
 
  Can you try disabling scrubbing and see if the leak stops?
 
  ceph osd tell \* injectargs '--osd-scrub-load-threshold .01'
 
  (that will work for 0.56.1, but is fixed in later versions, btw.)  On
  newer code,
 
  ceph osd tell \* injectargs '--osd-scrub-min-interval 100'
  ceph osd tell \* injectargs '--osd-scrub-max-interval 1000'
 
 Ok, I just did that.
 (I have 0.56.1 + a few more patches from the bobtail branch (up to
 c5fe0965572c07... )
 
 I'll report back tomorrow.
 
 
  Tracking this via
 
  http://tracker.ceph.com/issues/3883
 
 Should I post the updates on the ML or on the ticket ?

Either or both.  We try to keep the ticket up to date, either way.

Thanks!
s
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-27 Thread Sylvain Munaut
Hi,

Just to keep you posted,  upgraded our cluster yesterday to a custom
compiled 0.56.1 and it has now been more than 24h and there is no sign
on memory leak anymore. Previously it would rise by ~ 100 M every 24h
almost like clock work and now, it's been slightly more than 24h and
memory is stable. (it fluctuates, but no large jumps that stay
forever).

Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-27 Thread Sage Weil
On Sun, 27 Jan 2013, Sylvain Munaut wrote:
 Hi,
 
 Just to keep you posted,  upgraded our cluster yesterday to a custom
 compiled 0.56.1 and it has now been more than 24h and there is no sign
 on memory leak anymore. Previously it would rise by ~ 100 M every 24h
 almost like clock work and now, it's been slightly more than 24h and
 memory is stable. (it fluctuates, but no large jumps that stay
 forever).

That's great news.  We've been trying to replicate the argonaut leak here 
on argonaut and haven't succeeded so far.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-27 Thread Sylvain Munaut
Hi,

 Just to keep you posted,  upgraded our cluster yesterday to a custom
 compiled 0.56.1 and it has now been more than 24h and there is no sign
 on memory leak anymore. Previously it would rise by ~ 100 M every 24h
 almost like clock work and now, it's been slightly more than 24h and
 memory is stable. (it fluctuates, but no large jumps that stay
 forever).

 That's great news.  We've been trying to replicate the argonaut leak here
 on argonaut and haven't succeeded so far.

To be entirely complete, I also upgraded the kernel RBD client and
since the leak happened while scrubbing the RBD pool, maybe the client
behavior makes a difference..

Previously they were running kernel 3.6.8, they're now running 3.6.11
with all the ceph related patch from 3.8 backported ( ~ 150 patches ).

Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-25 Thread Sébastien Han
Hi,

Could provide those heaps? Is it possible?

--
Regards,
Sébastien Han.


On Tue, Jan 22, 2013 at 10:38 PM, Sébastien Han han.sebast...@gmail.com wrote:
 Well ideally you want to run the profiler during the scrubbing process
 when the memory leaks appear :-).
 --
 Regards,
 Sébastien Han.


 On Tue, Jan 22, 2013 at 10:32 PM, Sylvain Munaut
 s.mun...@whatever-company.com wrote:
 Hi,

 I don't really want to try the mem profiler, I had quite a bad
 experience with it on a test cluster. While running the profiler some
 OSD crashed...
 The only way to fix this is to provide a heap dump. Could you provide one?

 I just did:

 ceph osd tell 0 heap start_profiler
 ceph osd tell 0 heap dump
 ceph osd tell 0 heap stop_profiler

 and it produced osd.0.profile.0001.heap

 Is it enough or do I actually have to leave it running ?

 I had to stop the profiler because after doing the dump, the OSD
 process was taking 100% of CPU ... stopping the profiler restored it
 to normal.

 Cheers,

 Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-25 Thread Sylvain Munaut
 Could provide those heaps? Is it possible?

We're updating this weekend to 0.56.1.

If it still happens after the update, I'll try and reproduce it on our
test infra and do the profile there, because unfortunately running the
profiler seem to make it eat up CPU and RAM a lot ...

I also need to test is it happens when I force a scrub myself because
I can't let the profile run the whole day and just wait for it to
happen naturally, so I need a way to trigger a scrub of all PGs on a
given pool.


Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sylvain Munaut
Hi,

Since I have ceph in prod, I experienced a memory leak in the OSD
forcing to restart them every 5 or 6 days. Without that the OSD
process just grows infinitely and eventually gets killed by the OOM
killer. (To make sure it wasn't legitimate, I left one grow up to 4G
or RSS ...).

Here's for example the RSS usage of the 12 OSDs process
http://i.imgur.com/ZJxyldq.png during a few hours.

What I've just noticed is that if I look at the logs of the osd
process right when it grows, I can see it's scrubbing PGs from pool
#3. When scrubbing PGs from other pools, nothing really happens memory
wise.

Pool #3 is the pool where I have all the RBD images for the VMs and so
have a bunch of small read/write/modify. The other pools are used by
RGW for object storage and are mostly write-once,read-many-times of
relatively large objects.

I'm planning to upgrade to 0.56.1 this week end and I was hoping to
see if someone knew if that issue had been fixed with the scrubbing
code ?

I've seen other posts about memory leaks but at the time, it wasn't
confirmed what was the source. Here I clearly see it's the scrubbing
on pools that have RBD image.

Cheers,

  Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sébastien Han
Hi,

I originally started a thread around these memory leaks problems here:
http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11000.html

I'm happy to see that someone supports my theory about the scrubbing
process leaking the memory. I only use RBD from Ceph, so your theory
makes sense as well. Unfortunately, since I run a production platform
I don't really want to try the mem profiler, I had quite a bad
experience with it on a test cluster. While running the profiler some
OSD crashed...
The only way to fix this is to provide a heap dump. Could you provide one?

Moreover I can't reproduce the problem on my test environment... :(

--
Regards,
Sébastien Han.


On Tue, Jan 22, 2013 at 9:01 PM, Sylvain Munaut
s.mun...@whatever-company.com wrote:
 Hi,

 Since I have ceph in prod, I experienced a memory leak in the OSD
 forcing to restart them every 5 or 6 days. Without that the OSD
 process just grows infinitely and eventually gets killed by the OOM
 killer. (To make sure it wasn't legitimate, I left one grow up to 4G
 or RSS ...).

 Here's for example the RSS usage of the 12 OSDs process
 http://i.imgur.com/ZJxyldq.png during a few hours.

 What I've just noticed is that if I look at the logs of the osd
 process right when it grows, I can see it's scrubbing PGs from pool
 #3. When scrubbing PGs from other pools, nothing really happens memory
 wise.

 Pool #3 is the pool where I have all the RBD images for the VMs and so
 have a bunch of small read/write/modify. The other pools are used by
 RGW for object storage and are mostly write-once,read-many-times of
 relatively large objects.

 I'm planning to upgrade to 0.56.1 this week end and I was hoping to
 see if someone knew if that issue had been fixed with the scrubbing
 code ?

 I've seen other posts about memory leaks but at the time, it wasn't
 confirmed what was the source. Here I clearly see it's the scrubbing
 on pools that have RBD image.

 Cheers,

   Sylvain
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sylvain Munaut
Hi,

 I don't really want to try the mem profiler, I had quite a bad
 experience with it on a test cluster. While running the profiler some
 OSD crashed...
 The only way to fix this is to provide a heap dump. Could you provide one?

I just did:

ceph osd tell 0 heap start_profiler
ceph osd tell 0 heap dump
ceph osd tell 0 heap stop_profiler

and it produced osd.0.profile.0001.heap

Is it enough or do I actually have to leave it running ?

I had to stop the profiler because after doing the dump, the OSD
process was taking 100% of CPU ... stopping the profiler restored it
to normal.

Cheers,

Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [0.48.3] OSD memory leak when scrubbing

2013-01-22 Thread Sébastien Han
Well ideally you want to run the profiler during the scrubbing process
when the memory leaks appear :-).
--
Regards,
Sébastien Han.


On Tue, Jan 22, 2013 at 10:32 PM, Sylvain Munaut
s.mun...@whatever-company.com wrote:
 Hi,

 I don't really want to try the mem profiler, I had quite a bad
 experience with it on a test cluster. While running the profiler some
 OSD crashed...
 The only way to fix this is to provide a heap dump. Could you provide one?

 I just did:

 ceph osd tell 0 heap start_profiler
 ceph osd tell 0 heap dump
 ceph osd tell 0 heap stop_profiler

 and it produced osd.0.profile.0001.heap

 Is it enough or do I actually have to leave it running ?

 I had to stop the profiler because after doing the dump, the OSD
 process was taking 100% of CPU ... stopping the profiler restored it
 to normal.

 Cheers,

 Sylvain
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html