Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-28 Thread Gregory Farnum
On Thu, May 28, 2015 at 1:04 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:


 On 05/27/2015 10:30 PM, Gregory Farnum wrote:

 On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman
 kenneth.waege...@ugent.be wrote:

 We are also running a full backup sync to cephfs, using multiple
 distributed
 rsync streams (with zkrsync), and also ran in this issue today on Hammer
 0.94.1  .
 After setting the beacon higer, and eventually clearing the journal, it
 stabilized again.

 We were using ceph-fuse to mount the cephfs, not the ceph kernel client.


 What's your MDS cache size set to?

 I did set it to 100 before (we have 64G of ram for the mds) trying to
 get rid of the 'Client .. failing to respond to cache pressure' messages

Oh, that's definitely enough if one client is eating it all up to run
into this, without that patch I referenced. :)
-Greg

  Did you have any warnings in the

 ceph log about clients not releasing caps?

 Unfortunately lost the logs of before it happened.. But nothing in the new
 logs about that, I will follow this up


 I think you could hit this in ceph-fuse as well on hammer, although we
 just merged in a fix: https://github.com/ceph/ceph/pull/4653
 -Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-28 Thread Kenneth Waegeman



On 05/27/2015 10:30 PM, Gregory Farnum wrote:

On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:

We are also running a full backup sync to cephfs, using multiple distributed
rsync streams (with zkrsync), and also ran in this issue today on Hammer
0.94.1  .
After setting the beacon higer, and eventually clearing the journal, it
stabilized again.

We were using ceph-fuse to mount the cephfs, not the ceph kernel client.


What's your MDS cache size set to?
I did set it to 100 before (we have 64G of ram for the mds) trying 
to get rid of the 'Client .. failing to respond to cache pressure' messages

 Did you have any warnings in the

ceph log about clients not releasing caps?
Unfortunately lost the logs of before it happened.. But nothing in the 
new logs about that, I will follow this up


I think you could hit this in ceph-fuse as well on hammer, although we
just merged in a fix: https://github.com/ceph/ceph/pull/4653
-Greg


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-27 Thread Kenneth Waegeman
We are also running a full backup sync to cephfs, using multiple 
distributed rsync streams (with zkrsync), and also ran in this issue 
today on Hammer 0.94.1  .
After setting the beacon higer, and eventually clearing the journal, it 
stabilized again.


We were using ceph-fuse to mount the cephfs, not the ceph kernel client.


On 05/25/2015 11:55 AM, Yan, Zheng wrote:

the kernel client bug should be fixed by
https://github.com/ceph/ceph-client/commit/72f22efb658e6f9e126b2b0fcb065f66ffd02239
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-27 Thread Gregory Farnum
On Wed, May 27, 2015 at 6:49 AM, Kenneth Waegeman
kenneth.waege...@ugent.be wrote:
 We are also running a full backup sync to cephfs, using multiple distributed
 rsync streams (with zkrsync), and also ran in this issue today on Hammer
 0.94.1  .
 After setting the beacon higer, and eventually clearing the journal, it
 stabilized again.

 We were using ceph-fuse to mount the cephfs, not the ceph kernel client.

What's your MDS cache size set to? Did you have any warnings in the
ceph log about clients not releasing caps?

I think you could hit this in ceph-fuse as well on hammer, although we
just merged in a fix: https://github.com/ceph/ceph/pull/4653
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Gregory Farnum
On Fri, May 22, 2015 at 12:45 PM, Adam Tygart mo...@ksu.edu wrote:
 Fair enough. Anyway, is it safe to now increase the 'mds beacon grace'
 to try and get the mds server functional again?

Yep! Let us know how it goes...


 I realize there is nothing simple about the things that are being
 accomplished here, and thank everyone for their hard work on making
 this stuff work as well as it does.

:)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread John Spray



On 22/05/2015 20:06, Gregory Farnum wrote:



Ugh. We appear to be trying to allocate too much memory for this event
in the journal dump; we'll need to fix this. :(
It's not even per-event, it tries to load the entire journal into memory 
in one go.  This a hangover from the old Dumper/Resetter code that I 
pulled into cephfs-journal-tool.


http://tracker.ceph.com/issues/11746

Cheers,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
Fair enough. Anyway, is it safe to now increase the 'mds beacon grace'
to try and get the mds server functional again?

I realize there is nothing simple about the things that are being
accomplished here, and thank everyone for their hard work on making
this stuff work as well as it does.

--
Adam

On Fri, May 22, 2015 at 2:39 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, May 22, 2015 at 12:34 PM, Adam Tygart mo...@ksu.edu wrote:
 I believe I grabbed all of theses files:

 for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p
 metadata get ${x} /tmp/metadata/${x}; done
 tar czSf journal.tar.gz /tmp/metadata

 https://drive.google.com/file/d/0B4XF1RWjuGh5MVFqVFZfNmpfQWc/view?usp=sharing

 When this crash occurred, the rsync process should have been going
 through all of this sequentially, one file at a time. There weren't
 simultaneous rsyncs, and hadn't been for at least a couple days. I
 would have hoped that if the files hadn't been touched in days that
 *something* would have forcibly closed them and flushed them out of
 the caches. I also would have thought that a close file/flush in rsync
 (which I am sure it does, after finishing writing a file) would have
 let them close in the cephfs session.

 This is tricker than it sounds with the kernel dcache, unfortunately.
 We improved it a bit just last week but we'll have to try and diagnose
 what happened in this case more before we can say if it was that issue
 or something else.
 -Greg


 --
 Adam

 On Fri, May 22, 2015 at 2:06 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, May 22, 2015 at 11:34 AM, Adam Tygart mo...@ksu.edu wrote:
 On Fri, May 22, 2015 at 11:47 AM, John Spray john.sp...@redhat.com wrote:


 On 22/05/2015 15:33, Adam Tygart wrote:

 Hello all,

 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.

 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].


 You found a bug, or more correctly you probably found multiple bugs...

 It looks like your journal contains an EOpen event that lists 5307092 open
 files.  Because the MDS only drops its lock between events, not during
 processing a single one, this is causing the heartbeat map to think the 
 MDS
 has locked up, so it's getting killed.

 So firstly we have to fix this to have appropriate calls into
 MDS::heartbeat_reset while iterating over lists of unbounded length in
 EMetablob::replay.  That would fix the false death of the MDS resulting 
 from
 the heartbeat expiry.

 Secondly, this EOpen was a 2.6GB log event.  Something has almost 
 certainly
 gone wrong when we see that data structure grow so large, so we should
 really be imposing some artificial cap there and catching the situation
 earlier, rather than journal this monster event and only hitting issues
 during replay.

 Thirdly, something is apparently leading the MDS to think that 5 million
 files were open in this particular log segment.  It seems like an 
 improbable
 situation given that I can only see a single client in action here.  More
 investigation needed to see how this happened.  Can you describe the 
 client
 workload that was going on in the run up to the system breaking?

 We are in the process of rsyncing 60+ million files from our old
 fileservers into the new cephfs stores. We are down to the last user,
 and it seems he has at least 25% of tthe total files. we were copying
 files at a rate of about 2 million a day before this crash.


 Anyway, actions:

 1. I'm assuming your metadata is not sensitive, as you have shared this
 debug log.  Please could you use cephfs-journal-tool journal export
 ~/journal.bin to grab an offline copy of the raw journal, in case we need
 to look at it later (this might take a while since your journal seems so
 large, but the resulting file should compress reasonably well with tar
 cSzf).

 Problematic. cephfs-journal-tool is giving me a stacktrace:
 http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt

 2. optimistically, you may be able to get out of this situation by 
 modifying
 the mds_beacon_grace config option on the MDS (set it to something high).
 This will cause the MDS to continue sending beacons to the mons, even 
 when a
 thread is failing to yield promptly (as in this case), thereby preventing
 the mons from regarding the MDS as failed.  This hopefully will buy the 
 MDS
 enough time to complete replay and come up, assuming it doesn't run out of
 memory in the process of dealing with whatever strangeness is in the
 journal.

 I can hold off on this for a bit if you have something else you'd like
 me to save while it is in its broken state.


 Ugh. We appear to be trying to allocate too much memory for this event
 in the journal dump; we'll need to fix this. :(
 In the meantime, it's a harder workflow but if you can grab all the
 objects in the metadata pool whose name starts with 200. that's the
 raw data that the journal dumper is trying to 

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
Alright, bumping that up 10 worked. the MDS server came up and
recovered. Took about 1 minute.

Thanks again, guys.
--
Adam

On Fri, May 22, 2015 at 2:50 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, May 22, 2015 at 12:45 PM, Adam Tygart mo...@ksu.edu wrote:
 Fair enough. Anyway, is it safe to now increase the 'mds beacon grace'
 to try and get the mds server functional again?

 Yep! Let us know how it goes...


 I realize there is nothing simple about the things that are being
 accomplished here, and thank everyone for their hard work on making
 this stuff work as well as it does.

 :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Lincoln Bryant
Hi Adam,

You can get the MDS to spit out more debug information like so:

# ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'

At least then you can see where it's at when it crashes.

--Lincoln

On May 22, 2015, at 9:33 AM, Adam Tygart wrote:

 Hello all,
 
 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.
 
 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].
 
 Kernel debug from cephfs client during reconnection attempts:
 [732586.352173] ceph:  mdsc delayed_work
 [732586.352178] ceph:  check_delayed_caps
 [732586.352182] ceph:  lookup_mds_session 88202f01c000 210
 [732586.352185] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732586.352189] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352192] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732586.352195] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732586.352198] ceph:  mdsc delayed_work
 [732586.352200] ceph:  check_delayed_caps
 [732586.352202] ceph:  lookup_mds_session 881036cbf800 1
 [732586.352205] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732586.352207] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352210] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732586.352212] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732591.357123] ceph:  mdsc delayed_work
 [732591.357128] ceph:  check_delayed_caps
 [732591.357132] ceph:  lookup_mds_session 88202f01c000 210
 [732591.357135] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732591.357139] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732591.357142] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732591.357145] ceph:  mdsc delayed_work
 [732591.357147] ceph:  check_delayed_caps
 [732591.357149] ceph:  lookup_mds_session 881036cbf800 1
 [732591.357152] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732591.357154] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732591.357157] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732596.362076] ceph:  mdsc delayed_work
 [732596.362081] ceph:  check_delayed_caps
 [732596.362084] ceph:  lookup_mds_session 88202f01c000 210
 [732596.362087] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732596.362091] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732596.362094] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732596.362097] ceph:  mdsc delayed_work
 [732596.362099] ceph:  check_delayed_caps
 [732596.362101] ceph:  lookup_mds_session 881036cbf800 1
 [732596.362104] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732596.362106] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732596.362109] ceph:  mdsc put_session 881036cbf800 2 - 1
 
 Anybody have any debugging tips, or have any ideas on how to get an mds 
 stable?
 
 Server info: CentOS 7.1 with Ceph 0.94.1
 Client info: Gentoo, kernel cephfs. 3.19.5-gentoo
 
 I'd reboot the client, but at this point, I don't believe this is a
 client issue.
 
 [1] 
 https://drive.google.com/file/d/0B4XF1RWjuGh5WU1OZXpNb0Z1ck0/view?usp=sharing
 
 --
 Adam
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
Thanks for the quick response.

I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this one:
https://drive.google.com/file/d/0B4XF1RWjuGh5bXFnRzE1SHF6blE/view?usp=sharing

Based on these logs, it looks like heartbeat_map is_healthy 'MDS' just
times out and then the mds gets respawned.

--
Adam

On Fri, May 22, 2015 at 9:42 AM, Lincoln Bryant linco...@uchicago.edu wrote:
 Hi Adam,

 You can get the MDS to spit out more debug information like so:

 # ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'

 At least then you can see where it's at when it crashes.

 --Lincoln

 On May 22, 2015, at 9:33 AM, Adam Tygart wrote:

 Hello all,

 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.

 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].

 Kernel debug from cephfs client during reconnection attempts:
 [732586.352173] ceph:  mdsc delayed_work
 [732586.352178] ceph:  check_delayed_caps
 [732586.352182] ceph:  lookup_mds_session 88202f01c000 210
 [732586.352185] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732586.352189] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352192] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732586.352195] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732586.352198] ceph:  mdsc delayed_work
 [732586.352200] ceph:  check_delayed_caps
 [732586.352202] ceph:  lookup_mds_session 881036cbf800 1
 [732586.352205] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732586.352207] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352210] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732586.352212] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732591.357123] ceph:  mdsc delayed_work
 [732591.357128] ceph:  check_delayed_caps
 [732591.357132] ceph:  lookup_mds_session 88202f01c000 210
 [732591.357135] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732591.357139] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732591.357142] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732591.357145] ceph:  mdsc delayed_work
 [732591.357147] ceph:  check_delayed_caps
 [732591.357149] ceph:  lookup_mds_session 881036cbf800 1
 [732591.357152] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732591.357154] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732591.357157] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732596.362076] ceph:  mdsc delayed_work
 [732596.362081] ceph:  check_delayed_caps
 [732596.362084] ceph:  lookup_mds_session 88202f01c000 210
 [732596.362087] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732596.362091] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732596.362094] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732596.362097] ceph:  mdsc delayed_work
 [732596.362099] ceph:  check_delayed_caps
 [732596.362101] ceph:  lookup_mds_session 881036cbf800 1
 [732596.362104] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732596.362106] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732596.362109] ceph:  mdsc put_session 881036cbf800 2 - 1

 Anybody have any debugging tips, or have any ideas on how to get an mds 
 stable?

 Server info: CentOS 7.1 with Ceph 0.94.1
 Client info: Gentoo, kernel cephfs. 3.19.5-gentoo

 I'd reboot the client, but at this point, I don't believe this is a
 client issue.

 [1] 
 https://drive.google.com/file/d/0B4XF1RWjuGh5WU1OZXpNb0Z1ck0/view?usp=sharing

 --
 Adam
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Lincoln Bryant
I've experienced MDS issues in the past, but nothing sticks out to me in your 
logs.

Are you using a single active MDS with failover, or multiple active MDS? 

--Lincoln

On May 22, 2015, at 10:10 AM, Adam Tygart wrote:

 Thanks for the quick response.
 
 I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this one:
 https://drive.google.com/file/d/0B4XF1RWjuGh5bXFnRzE1SHF6blE/view?usp=sharing
 
 Based on these logs, it looks like heartbeat_map is_healthy 'MDS' just
 times out and then the mds gets respawned.
 
 --
 Adam
 
 On Fri, May 22, 2015 at 9:42 AM, Lincoln Bryant linco...@uchicago.edu wrote:
 Hi Adam,
 
 You can get the MDS to spit out more debug information like so:
 
# ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'
 
 At least then you can see where it's at when it crashes.
 
 --Lincoln
 
 On May 22, 2015, at 9:33 AM, Adam Tygart wrote:
 
 Hello all,
 
 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.
 
 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].
 
 Kernel debug from cephfs client during reconnection attempts:
 [732586.352173] ceph:  mdsc delayed_work
 [732586.352178] ceph:  check_delayed_caps
 [732586.352182] ceph:  lookup_mds_session 88202f01c000 210
 [732586.352185] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732586.352189] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352192] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732586.352195] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732586.352198] ceph:  mdsc delayed_work
 [732586.352200] ceph:  check_delayed_caps
 [732586.352202] ceph:  lookup_mds_session 881036cbf800 1
 [732586.352205] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732586.352207] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352210] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732586.352212] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732591.357123] ceph:  mdsc delayed_work
 [732591.357128] ceph:  check_delayed_caps
 [732591.357132] ceph:  lookup_mds_session 88202f01c000 210
 [732591.357135] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732591.357139] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732591.357142] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732591.357145] ceph:  mdsc delayed_work
 [732591.357147] ceph:  check_delayed_caps
 [732591.357149] ceph:  lookup_mds_session 881036cbf800 1
 [732591.357152] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732591.357154] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732591.357157] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732596.362076] ceph:  mdsc delayed_work
 [732596.362081] ceph:  check_delayed_caps
 [732596.362084] ceph:  lookup_mds_session 88202f01c000 210
 [732596.362087] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732596.362091] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732596.362094] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732596.362097] ceph:  mdsc delayed_work
 [732596.362099] ceph:  check_delayed_caps
 [732596.362101] ceph:  lookup_mds_session 881036cbf800 1
 [732596.362104] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732596.362106] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732596.362109] ceph:  mdsc put_session 881036cbf800 2 - 1
 
 Anybody have any debugging tips, or have any ideas on how to get an mds 
 stable?
 
 Server info: CentOS 7.1 with Ceph 0.94.1
 Client info: Gentoo, kernel cephfs. 3.19.5-gentoo
 
 I'd reboot the client, but at this point, I don't believe this is a
 client issue.
 
 [1] 
 https://drive.google.com/file/d/0B4XF1RWjuGh5WU1OZXpNb0Z1ck0/view?usp=sharing
 
 --
 Adam
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
I knew I forgot to include something with my initial e-mail.

Single active with failover.

dumped mdsmap epoch 30608
epoch   30608
flags   0
created 2015-04-02 16:15:55.209894
modified2015-05-22 11:39:15.992774
tableserver 0
root0
session_timeout 60
session_autoclose   300
max_file_size   17592186044416
last_failure30606
last_failure_osd_epoch  24298
compat  compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in
separate object,5=mds uses versioned encoding,6=dirfrag is stored in
omap,8=no anchor table}
max_mds 1
in  0
up  {0=20284976}
failed
stopped
data_pools  25
metadata_pool   27
inline_data disabled
20285024:   10.5.38.2:7021/32024 'hobbit02' mds.-1.0 up:standby seq 1
20346784:   10.5.38.1:6957/223554 'hobbit01' mds.-1.0 up:standby seq 1
20284976:   10.5.38.13:6926/66700 'hobbit13' mds.0.1696 up:replay seq 1

--
Adam

On Fri, May 22, 2015 at 11:37 AM, Lincoln Bryant linco...@uchicago.edu wrote:
 I've experienced MDS issues in the past, but nothing sticks out to me in your 
 logs.

 Are you using a single active MDS with failover, or multiple active MDS?

 --Lincoln

 On May 22, 2015, at 10:10 AM, Adam Tygart wrote:

 Thanks for the quick response.

 I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this one:
 https://drive.google.com/file/d/0B4XF1RWjuGh5bXFnRzE1SHF6blE/view?usp=sharing

 Based on these logs, it looks like heartbeat_map is_healthy 'MDS' just
 times out and then the mds gets respawned.

 --
 Adam

 On Fri, May 22, 2015 at 9:42 AM, Lincoln Bryant linco...@uchicago.edu 
 wrote:
 Hi Adam,

 You can get the MDS to spit out more debug information like so:

# ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'

 At least then you can see where it's at when it crashes.

 --Lincoln

 On May 22, 2015, at 9:33 AM, Adam Tygart wrote:

 Hello all,

 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.

 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].

 Kernel debug from cephfs client during reconnection attempts:
 [732586.352173] ceph:  mdsc delayed_work
 [732586.352178] ceph:  check_delayed_caps
 [732586.352182] ceph:  lookup_mds_session 88202f01c000 210
 [732586.352185] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732586.352189] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352192] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732586.352195] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732586.352198] ceph:  mdsc delayed_work
 [732586.352200] ceph:  check_delayed_caps
 [732586.352202] ceph:  lookup_mds_session 881036cbf800 1
 [732586.352205] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732586.352207] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352210] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732586.352212] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732591.357123] ceph:  mdsc delayed_work
 [732591.357128] ceph:  check_delayed_caps
 [732591.357132] ceph:  lookup_mds_session 88202f01c000 210
 [732591.357135] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732591.357139] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732591.357142] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732591.357145] ceph:  mdsc delayed_work
 [732591.357147] ceph:  check_delayed_caps
 [732591.357149] ceph:  lookup_mds_session 881036cbf800 1
 [732591.357152] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732591.357154] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732591.357157] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732596.362076] ceph:  mdsc delayed_work
 [732596.362081] ceph:  check_delayed_caps
 [732596.362084] ceph:  lookup_mds_session 88202f01c000 210
 [732596.362087] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732596.362091] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732596.362094] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732596.362097] ceph:  mdsc delayed_work
 [732596.362099] ceph:  check_delayed_caps
 [732596.362101] ceph:  lookup_mds_session 881036cbf800 1
 [732596.362104] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732596.362106] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732596.362109] ceph:  mdsc put_session 881036cbf800 2 - 1

 Anybody have any debugging tips, or have any ideas on how to get an mds 
 stable?

 Server info: CentOS 7.1 with Ceph 0.94.1
 Client info: Gentoo, kernel cephfs. 3.19.5-gentoo

 I'd reboot the client, but at this point, I don't believe this is a
 client issue.

 [1] 
 https://drive.google.com/file/d/0B4XF1RWjuGh5WU1OZXpNb0Z1ck0/view?usp=sharing

 --
 Adam
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread John Spray



On 22/05/2015 15:33, Adam Tygart wrote:

Hello all,

The ceph-mds servers in our cluster are performing a constant
boot-replay-crash in our systems.

I have enable debug logging for the mds for a restart cycle on one of
the nodes[1].


You found a bug, or more correctly you probably found multiple bugs...

It looks like your journal contains an EOpen event that lists 5307092 
open files.  Because the MDS only drops its lock between events, not 
during processing a single one, this is causing the heartbeat map to 
think the MDS has locked up, so it's getting killed.


So firstly we have to fix this to have appropriate calls into 
MDS::heartbeat_reset while iterating over lists of unbounded length in 
EMetablob::replay.  That would fix the false death of the MDS resulting 
from the heartbeat expiry.


Secondly, this EOpen was a 2.6GB log event.  Something has almost 
certainly gone wrong when we see that data structure grow so large, so 
we should really be imposing some artificial cap there and catching the 
situation earlier, rather than journal this monster event and only 
hitting issues during replay.


Thirdly, something is apparently leading the MDS to think that 5 million 
files were open in this particular log segment.  It seems like an 
improbable situation given that I can only see a single client in action 
here.  More investigation needed to see how this happened.  Can you 
describe the client workload that was going on in the run up to the 
system breaking?


Anyway, actions:

1. I'm assuming your metadata is not sensitive, as you have shared this 
debug log.  Please could you use cephfs-journal-tool journal export 
~/journal.bin to grab an offline copy of the raw journal, in case we 
need to look at it later (this might take a while since your journal 
seems so large, but the resulting file should compress reasonably well 
with tar cSzf).


2. optimistically, you may be able to get out of this situation by 
modifying the mds_beacon_grace config option on the MDS (set it to 
something high).  This will cause the MDS to continue sending beacons to 
the mons, even when a thread is failing to yield promptly (as in this 
case), thereby preventing the mons from regarding the MDS as failed.  
This hopefully will buy the MDS enough time to complete replay and come 
up, assuming it doesn't run out of memory in the process of dealing with 
whatever strangeness is in the journal.


3. If your MDS eventually makes it through recovery, unmount your client 
and use ceph daemon mds.id flush journal to flush and trim the 
journal: this should result in a situation where the next time the MDS 
starts, the oversized journal entries are no longer present and startup 
should go smoothly.


Cheers,
John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Lincoln Bryant
I notice in both logs, the last entry before the MDS restart/failover is when 
the mds is replaying the journal and gets to 
/homes/gundimed/IPD/10kb/1e-500d/DisplayLog/

2015-05-22 09:59:19.116231 7f9d930c1700 10 mds.0.journal EMetaBlob.replay for 
[2,head] had [inode 13f8e31 [...2,head] 
/homes/gundimed/IPD/10kb/1e-500d/DisplayLog/ auth v20776 f(v0 m2015-05-22 
02:34:09.00 357=357+0) n(v1 rc2015-05-22 02:34:09.00 b71340955004 
358=357+1) (iversion lock) | dirfrag=1 dirty=1 0x6ded9c8]

2015-05-22 08:04:31.993007 7f87afb2f700 10 mds.0.journal EMetaBlob.replay for 
[2,head] had [inode 13f8e31 [...2,head] 
/homes/gundimed/IPD/10kb/1e-500d/DisplayLog/ auth v20776 f(v0 m2015-05-22 
02:34:09.00 357=357+0) n(v1 rc2015-05-22 02:34:09.00 b71340955004 
358=357+1) (iversion lock) | dirfrag=1 dirty=1 0x76a59c8]

Maybe there's some problem in this part of the journal? Or maybe that's the end 
of the journal and it crashes afterwards? No idea :( Hopefully one of the devs 
can weigh in.

--Lincoln

On May 22, 2015, at 11:40 AM, Adam Tygart wrote:

 I knew I forgot to include something with my initial e-mail.
 
 Single active with failover.
 
 dumped mdsmap epoch 30608
 epoch   30608
 flags   0
 created 2015-04-02 16:15:55.209894
 modified2015-05-22 11:39:15.992774
 tableserver 0
 root0
 session_timeout 60
 session_autoclose   300
 max_file_size   17592186044416
 last_failure30606
 last_failure_osd_epoch  24298
 compat  compat={},rocompat={},incompat={1=base v0.20,2=client
 writeable ranges,3=default file layouts on dirs,4=dir inode in
 separate object,5=mds uses versioned encoding,6=dirfrag is stored in
 omap,8=no anchor table}
 max_mds 1
 in  0
 up  {0=20284976}
 failed
 stopped
 data_pools  25
 metadata_pool   27
 inline_data disabled
 20285024:   10.5.38.2:7021/32024 'hobbit02' mds.-1.0 up:standby seq 1
 20346784:   10.5.38.1:6957/223554 'hobbit01' mds.-1.0 up:standby seq 1
 20284976:   10.5.38.13:6926/66700 'hobbit13' mds.0.1696 up:replay seq 1
 
 --
 Adam
 
 On Fri, May 22, 2015 at 11:37 AM, Lincoln Bryant linco...@uchicago.edu 
 wrote:
 I've experienced MDS issues in the past, but nothing sticks out to me in 
 your logs.
 
 Are you using a single active MDS with failover, or multiple active MDS?
 
 --Lincoln
 
 On May 22, 2015, at 10:10 AM, Adam Tygart wrote:
 
 Thanks for the quick response.
 
 I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this 
 one:
 https://drive.google.com/file/d/0B4XF1RWjuGh5bXFnRzE1SHF6blE/view?usp=sharing
 
 Based on these logs, it looks like heartbeat_map is_healthy 'MDS' just
 times out and then the mds gets respawned.
 
 --
 Adam
 
 On Fri, May 22, 2015 at 9:42 AM, Lincoln Bryant linco...@uchicago.edu 
 wrote:
 Hi Adam,
 
 You can get the MDS to spit out more debug information like so:
 
   # ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'
 
 At least then you can see where it's at when it crashes.
 
 --Lincoln
 
 On May 22, 2015, at 9:33 AM, Adam Tygart wrote:
 
 Hello all,
 
 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.
 
 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].
 
 Kernel debug from cephfs client during reconnection attempts:
 [732586.352173] ceph:  mdsc delayed_work
 [732586.352178] ceph:  check_delayed_caps
 [732586.352182] ceph:  lookup_mds_session 88202f01c000 210
 [732586.352185] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732586.352189] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352192] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732586.352195] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732586.352198] ceph:  mdsc delayed_work
 [732586.352200] ceph:  check_delayed_caps
 [732586.352202] ceph:  lookup_mds_session 881036cbf800 1
 [732586.352205] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732586.352207] ceph:  send_renew_caps ignoring mds0 (up:replay)
 [732586.352210] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732586.352212] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732591.357123] ceph:  mdsc delayed_work
 [732591.357128] ceph:  check_delayed_caps
 [732591.357132] ceph:  lookup_mds_session 88202f01c000 210
 [732591.357135] ceph:  mdsc get_session 88202f01c000 210 - 211
 [732591.357139] ceph:  add_cap_releases 88202f01c000 mds0 extra 680
 [732591.357142] ceph:  mdsc put_session 88202f01c000 211 - 210
 [732591.357145] ceph:  mdsc delayed_work
 [732591.357147] ceph:  check_delayed_caps
 [732591.357149] ceph:  lookup_mds_session 881036cbf800 1
 [732591.357152] ceph:  mdsc get_session 881036cbf800 1 - 2
 [732591.357154] ceph:  add_cap_releases 881036cbf800 mds0 extra 680
 [732591.357157] ceph:  mdsc put_session 881036cbf800 2 - 1
 [732596.362076] ceph:  mdsc delayed_work
 [732596.362081] ceph:  check_delayed_caps
 [732596.362084] ceph:  lookup_mds_session 

Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Gregory Farnum
On Fri, May 22, 2015 at 11:34 AM, Adam Tygart mo...@ksu.edu wrote:
 On Fri, May 22, 2015 at 11:47 AM, John Spray john.sp...@redhat.com wrote:


 On 22/05/2015 15:33, Adam Tygart wrote:

 Hello all,

 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.

 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].


 You found a bug, or more correctly you probably found multiple bugs...

 It looks like your journal contains an EOpen event that lists 5307092 open
 files.  Because the MDS only drops its lock between events, not during
 processing a single one, this is causing the heartbeat map to think the MDS
 has locked up, so it's getting killed.

 So firstly we have to fix this to have appropriate calls into
 MDS::heartbeat_reset while iterating over lists of unbounded length in
 EMetablob::replay.  That would fix the false death of the MDS resulting from
 the heartbeat expiry.

 Secondly, this EOpen was a 2.6GB log event.  Something has almost certainly
 gone wrong when we see that data structure grow so large, so we should
 really be imposing some artificial cap there and catching the situation
 earlier, rather than journal this monster event and only hitting issues
 during replay.

 Thirdly, something is apparently leading the MDS to think that 5 million
 files were open in this particular log segment.  It seems like an improbable
 situation given that I can only see a single client in action here.  More
 investigation needed to see how this happened.  Can you describe the client
 workload that was going on in the run up to the system breaking?

 We are in the process of rsyncing 60+ million files from our old
 fileservers into the new cephfs stores. We are down to the last user,
 and it seems he has at least 25% of tthe total files. we were copying
 files at a rate of about 2 million a day before this crash.


 Anyway, actions:

 1. I'm assuming your metadata is not sensitive, as you have shared this
 debug log.  Please could you use cephfs-journal-tool journal export
 ~/journal.bin to grab an offline copy of the raw journal, in case we need
 to look at it later (this might take a while since your journal seems so
 large, but the resulting file should compress reasonably well with tar
 cSzf).

 Problematic. cephfs-journal-tool is giving me a stacktrace:
 http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt

 2. optimistically, you may be able to get out of this situation by modifying
 the mds_beacon_grace config option on the MDS (set it to something high).
 This will cause the MDS to continue sending beacons to the mons, even when a
 thread is failing to yield promptly (as in this case), thereby preventing
 the mons from regarding the MDS as failed.  This hopefully will buy the MDS
 enough time to complete replay and come up, assuming it doesn't run out of
 memory in the process of dealing with whatever strangeness is in the
 journal.

 I can hold off on this for a bit if you have something else you'd like
 me to save while it is in its broken state.


Ugh. We appear to be trying to allocate too much memory for this event
in the journal dump; we'll need to fix this. :(
In the meantime, it's a harder workflow but if you can grab all the
objects in the metadata pool whose name starts with 200. that's the
raw data that the journal dumper is trying to grab and we can assemble
it ourselves for checking later on.

That said, you might actually have that many files still open in
some fashion from your clients. If you can dump caches on them
periodically, and/or limit the number of files that rsync is working
on at once, that should help keep this from happening again.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
On Fri, May 22, 2015 at 11:47 AM, John Spray john.sp...@redhat.com wrote:


 On 22/05/2015 15:33, Adam Tygart wrote:

 Hello all,

 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.

 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].


 You found a bug, or more correctly you probably found multiple bugs...

 It looks like your journal contains an EOpen event that lists 5307092 open
 files.  Because the MDS only drops its lock between events, not during
 processing a single one, this is causing the heartbeat map to think the MDS
 has locked up, so it's getting killed.

 So firstly we have to fix this to have appropriate calls into
 MDS::heartbeat_reset while iterating over lists of unbounded length in
 EMetablob::replay.  That would fix the false death of the MDS resulting from
 the heartbeat expiry.

 Secondly, this EOpen was a 2.6GB log event.  Something has almost certainly
 gone wrong when we see that data structure grow so large, so we should
 really be imposing some artificial cap there and catching the situation
 earlier, rather than journal this monster event and only hitting issues
 during replay.

 Thirdly, something is apparently leading the MDS to think that 5 million
 files were open in this particular log segment.  It seems like an improbable
 situation given that I can only see a single client in action here.  More
 investigation needed to see how this happened.  Can you describe the client
 workload that was going on in the run up to the system breaking?

We are in the process of rsyncing 60+ million files from our old
fileservers into the new cephfs stores. We are down to the last user,
and it seems he has at least 25% of tthe total files. we were copying
files at a rate of about 2 million a day before this crash.


 Anyway, actions:

 1. I'm assuming your metadata is not sensitive, as you have shared this
 debug log.  Please could you use cephfs-journal-tool journal export
 ~/journal.bin to grab an offline copy of the raw journal, in case we need
 to look at it later (this might take a while since your journal seems so
 large, but the resulting file should compress reasonably well with tar
 cSzf).

Problematic. cephfs-journal-tool is giving me a stacktrace:
http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt

 2. optimistically, you may be able to get out of this situation by modifying
 the mds_beacon_grace config option on the MDS (set it to something high).
 This will cause the MDS to continue sending beacons to the mons, even when a
 thread is failing to yield promptly (as in this case), thereby preventing
 the mons from regarding the MDS as failed.  This hopefully will buy the MDS
 enough time to complete replay and come up, assuming it doesn't run out of
 memory in the process of dealing with whatever strangeness is in the
 journal.

I can hold off on this for a bit if you have something else you'd like
me to save while it is in its broken state.

Thanks for the detailed writeup.
--
Adam

 3. If your MDS eventually makes it through recovery, unmount your client and
 use ceph daemon mds.id flush journal to flush and trim the journal: this
 should result in a situation where the next time the MDS starts, the
 oversized journal entries are no longer present and startup should go
 smoothly.

 Cheers,
 John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Adam Tygart
I believe I grabbed all of theses files:

for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p
metadata get ${x} /tmp/metadata/${x}; done
tar czSf journal.tar.gz /tmp/metadata

https://drive.google.com/file/d/0B4XF1RWjuGh5MVFqVFZfNmpfQWc/view?usp=sharing

When this crash occurred, the rsync process should have been going
through all of this sequentially, one file at a time. There weren't
simultaneous rsyncs, and hadn't been for at least a couple days. I
would have hoped that if the files hadn't been touched in days that
*something* would have forcibly closed them and flushed them out of
the caches. I also would have thought that a close file/flush in rsync
(which I am sure it does, after finishing writing a file) would have
let them close in the cephfs session.

--
Adam

On Fri, May 22, 2015 at 2:06 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, May 22, 2015 at 11:34 AM, Adam Tygart mo...@ksu.edu wrote:
 On Fri, May 22, 2015 at 11:47 AM, John Spray john.sp...@redhat.com wrote:


 On 22/05/2015 15:33, Adam Tygart wrote:

 Hello all,

 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.

 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].


 You found a bug, or more correctly you probably found multiple bugs...

 It looks like your journal contains an EOpen event that lists 5307092 open
 files.  Because the MDS only drops its lock between events, not during
 processing a single one, this is causing the heartbeat map to think the MDS
 has locked up, so it's getting killed.

 So firstly we have to fix this to have appropriate calls into
 MDS::heartbeat_reset while iterating over lists of unbounded length in
 EMetablob::replay.  That would fix the false death of the MDS resulting from
 the heartbeat expiry.

 Secondly, this EOpen was a 2.6GB log event.  Something has almost certainly
 gone wrong when we see that data structure grow so large, so we should
 really be imposing some artificial cap there and catching the situation
 earlier, rather than journal this monster event and only hitting issues
 during replay.

 Thirdly, something is apparently leading the MDS to think that 5 million
 files were open in this particular log segment.  It seems like an improbable
 situation given that I can only see a single client in action here.  More
 investigation needed to see how this happened.  Can you describe the client
 workload that was going on in the run up to the system breaking?

 We are in the process of rsyncing 60+ million files from our old
 fileservers into the new cephfs stores. We are down to the last user,
 and it seems he has at least 25% of tthe total files. we were copying
 files at a rate of about 2 million a day before this crash.


 Anyway, actions:

 1. I'm assuming your metadata is not sensitive, as you have shared this
 debug log.  Please could you use cephfs-journal-tool journal export
 ~/journal.bin to grab an offline copy of the raw journal, in case we need
 to look at it later (this might take a while since your journal seems so
 large, but the resulting file should compress reasonably well with tar
 cSzf).

 Problematic. cephfs-journal-tool is giving me a stacktrace:
 http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt

 2. optimistically, you may be able to get out of this situation by modifying
 the mds_beacon_grace config option on the MDS (set it to something high).
 This will cause the MDS to continue sending beacons to the mons, even when a
 thread is failing to yield promptly (as in this case), thereby preventing
 the mons from regarding the MDS as failed.  This hopefully will buy the MDS
 enough time to complete replay and come up, assuming it doesn't run out of
 memory in the process of dealing with whatever strangeness is in the
 journal.

 I can hold off on this for a bit if you have something else you'd like
 me to save while it is in its broken state.


 Ugh. We appear to be trying to allocate too much memory for this event
 in the journal dump; we'll need to fix this. :(
 In the meantime, it's a harder workflow but if you can grab all the
 objects in the metadata pool whose name starts with 200. that's the
 raw data that the journal dumper is trying to grab and we can assemble
 it ourselves for checking later on.

 That said, you might actually have that many files still open in
 some fashion from your clients. If you can dump caches on them
 periodically, and/or limit the number of files that rsync is working
 on at once, that should help keep this from happening again.
 -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS continually respawning (hammer)

2015-05-22 Thread Gregory Farnum
On Fri, May 22, 2015 at 12:34 PM, Adam Tygart mo...@ksu.edu wrote:
 I believe I grabbed all of theses files:

 for x in $(rados -p metadata ls | grep -E '^200\.'); do rados -p
 metadata get ${x} /tmp/metadata/${x}; done
 tar czSf journal.tar.gz /tmp/metadata

 https://drive.google.com/file/d/0B4XF1RWjuGh5MVFqVFZfNmpfQWc/view?usp=sharing

 When this crash occurred, the rsync process should have been going
 through all of this sequentially, one file at a time. There weren't
 simultaneous rsyncs, and hadn't been for at least a couple days. I
 would have hoped that if the files hadn't been touched in days that
 *something* would have forcibly closed them and flushed them out of
 the caches. I also would have thought that a close file/flush in rsync
 (which I am sure it does, after finishing writing a file) would have
 let them close in the cephfs session.

This is tricker than it sounds with the kernel dcache, unfortunately.
We improved it a bit just last week but we'll have to try and diagnose
what happened in this case more before we can say if it was that issue
or something else.
-Greg


 --
 Adam

 On Fri, May 22, 2015 at 2:06 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, May 22, 2015 at 11:34 AM, Adam Tygart mo...@ksu.edu wrote:
 On Fri, May 22, 2015 at 11:47 AM, John Spray john.sp...@redhat.com wrote:


 On 22/05/2015 15:33, Adam Tygart wrote:

 Hello all,

 The ceph-mds servers in our cluster are performing a constant
 boot-replay-crash in our systems.

 I have enable debug logging for the mds for a restart cycle on one of
 the nodes[1].


 You found a bug, or more correctly you probably found multiple bugs...

 It looks like your journal contains an EOpen event that lists 5307092 open
 files.  Because the MDS only drops its lock between events, not during
 processing a single one, this is causing the heartbeat map to think the MDS
 has locked up, so it's getting killed.

 So firstly we have to fix this to have appropriate calls into
 MDS::heartbeat_reset while iterating over lists of unbounded length in
 EMetablob::replay.  That would fix the false death of the MDS resulting 
 from
 the heartbeat expiry.

 Secondly, this EOpen was a 2.6GB log event.  Something has almost certainly
 gone wrong when we see that data structure grow so large, so we should
 really be imposing some artificial cap there and catching the situation
 earlier, rather than journal this monster event and only hitting issues
 during replay.

 Thirdly, something is apparently leading the MDS to think that 5 million
 files were open in this particular log segment.  It seems like an 
 improbable
 situation given that I can only see a single client in action here.  More
 investigation needed to see how this happened.  Can you describe the client
 workload that was going on in the run up to the system breaking?

 We are in the process of rsyncing 60+ million files from our old
 fileservers into the new cephfs stores. We are down to the last user,
 and it seems he has at least 25% of tthe total files. we were copying
 files at a rate of about 2 million a day before this crash.


 Anyway, actions:

 1. I'm assuming your metadata is not sensitive, as you have shared this
 debug log.  Please could you use cephfs-journal-tool journal export
 ~/journal.bin to grab an offline copy of the raw journal, in case we need
 to look at it later (this might take a while since your journal seems so
 large, but the resulting file should compress reasonably well with tar
 cSzf).

 Problematic. cephfs-journal-tool is giving me a stacktrace:
 http://people.beocat.cis.ksu.edu/~mozes/ceph/cephfs-journal-tool-stacktrace.txt

 2. optimistically, you may be able to get out of this situation by 
 modifying
 the mds_beacon_grace config option on the MDS (set it to something high).
 This will cause the MDS to continue sending beacons to the mons, even when 
 a
 thread is failing to yield promptly (as in this case), thereby preventing
 the mons from regarding the MDS as failed.  This hopefully will buy the MDS
 enough time to complete replay and come up, assuming it doesn't run out of
 memory in the process of dealing with whatever strangeness is in the
 journal.

 I can hold off on this for a bit if you have something else you'd like
 me to save while it is in its broken state.


 Ugh. We appear to be trying to allocate too much memory for this event
 in the journal dump; we'll need to fix this. :(
 In the meantime, it's a harder workflow but if you can grab all the
 objects in the metadata pool whose name starts with 200. that's the
 raw data that the journal dumper is trying to grab and we can assemble
 it ourselves for checking later on.

 That said, you might actually have that many files still open in
 some fashion from your clients. If you can dump caches on them
 periodically, and/or limit the number of files that rsync is working
 on at once, that should help keep this from happening again.
 -Greg
___