Re: [ceph-users] Multi-MDS Failover
On 19 May 2018 at 09:20, Scottixwrote: > It would be nice to have an option to have all IO blocked if it hits a > degraded state until it recovers. Since you are unaware of other MDS state, > seems like that would be tough to do. I agree this would be a nice knob to have from the perspective of having consistent (and easy to diagnose) client behaviour when such a situation occurs. However I don't think this is possible, if a client is working in a directory served via rank-0 MDS (whilst rank-1 has just gone down) it isn't going to know rank-0 is down until the MONs do. So to get the "all stop" you are talking about the client would then have to undo already committed IO(!), the only other option would be "pinging" all ranks on every metadata change, and that sounds horrible. Maybe this is a case where you'd be better off putting NFS in front of your CephFS? -- Cheers, ~Blairo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
So we have been testing this quite a bit, having the failure domain as partially available is ok for us but odd, since we don't know what will be down. Compared to a single MDS we know everything will be blocked. It would be nice to have an option to have all IO blocked if it hits a degraded state until it recovers. Since you are unaware of other MDS state, seems like that would be tough to do. I'll leave this as a feature request possibly in the future. On Fri, May 18, 2018 at 3:15 PM Gregory Farnumwrote: > On Fri, May 18, 2018 at 11:56 AM Webert de Souza Lima < > webert.b...@gmail.com> wrote: > >> Hello, >> >> >> On Mon, Apr 30, 2018 at 7:16 AM Daniel Baumann >> wrote: >> >>> additionally: if rank 0 is lost, the whole FS stands still (no new >>> client can mount the fs; no existing client can change a directory, >>> etc.). >>> >>> my guess is that the root of a cephfs (/; which is always served by rank >>> 0) is needed in order to do traversals/lookups of any directories on the >>> top-level (which then can be served by ranks 1-n). >>> >> >> Could someone confirm if this is actually how it works? Thanks. >> > > Yes, although I'd expect that clients can keep doing work in directories > they've already got opened (or in descendants of those). Perhaps I'm > missing something about that, though... > -Greg > > >> >> Regards, >> >> Webert Lima >> DevOps Engineer at MAV Tecnologia >> *Belo Horizonte - Brasil* >> *IRC NICK - WebertRLZ* >> >> >>> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
On Fri, May 18, 2018 at 11:56 AM Webert de Souza Limawrote: > Hello, > > > On Mon, Apr 30, 2018 at 7:16 AM Daniel Baumann > wrote: > >> additionally: if rank 0 is lost, the whole FS stands still (no new >> client can mount the fs; no existing client can change a directory, etc.). >> >> my guess is that the root of a cephfs (/; which is always served by rank >> 0) is needed in order to do traversals/lookups of any directories on the >> top-level (which then can be served by ranks 1-n). >> > > Could someone confirm if this is actually how it works? Thanks. > Yes, although I'd expect that clients can keep doing work in directories they've already got opened (or in descendants of those). Perhaps I'm missing something about that, though... -Greg > > Regards, > > Webert Lima > DevOps Engineer at MAV Tecnologia > *Belo Horizonte - Brasil* > *IRC NICK - WebertRLZ* > > >> ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
Hello, On Mon, Apr 30, 2018 at 7:16 AM Daniel Baumannwrote: > additionally: if rank 0 is lost, the whole FS stands still (no new > client can mount the fs; no existing client can change a directory, etc.). > > my guess is that the root of a cephfs (/; which is always served by rank > 0) is needed in order to do traversals/lookups of any directories on the > top-level (which then can be served by ranks 1-n). > Could someone confirm if this is actually how it works? Thanks. Regards, Webert Lima DevOps Engineer at MAV Tecnologia *Belo Horizonte - Brasil* *IRC NICK - WebertRLZ* > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
On 04/27/2018 07:11 PM, Patrick Donnelly wrote: > The answer is that there may be partial availability from > the up:active ranks which may hand out capabilities for the subtrees > they manage or no availability if that's not possible because it > cannot obtain the necessary locks. additionally: if rank 0 is lost, the whole FS stands still (no new client can mount the fs; no existing client can change a directory, etc.). my guess is that the root of a cephfs (/; which is always served by rank 0) is needed in order to do traversals/lookups of any directories on the top-level (which then can be served by ranks 1-n). last year, we had quite some troubles with unstable cephfs (MDS reliably and reproducibly crashing when hitting them with rsync over multi-TB directories with files all being <<1mb) and had lots of situations where ranks (most of the time including 0) were down. fortunatly we could always get the fs back my unmounting it on all clients, restarting all mds. the last of these unstabilities seem to have gone with 12.2.3/12.2.4 (we're now running 12.2.5). Regards, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
On Thu, Apr 26, 2018 at 7:04 PM, Scottixwrote: > Ok let me try to explain this better, we are doing this back and forth and > its not going anywhere. I'll just be as genuine as I can and explain the > issue. > > What we are testing is a critical failure scenario and actually more of a > real world scenario. Basically just what happens when it is 1AM and the shit > hits the fan, half of your servers are down and 1 of the 3 MDS boxes are > still alive. > There is one very important fact that happens with CephFS and when the > single Active MDS server fails. It is guaranteed 100% all IO is blocked. No > split-brain, no corrupted data, 100% guaranteed ever since we started using > CephFS > > > Now with multi_mds, I understand this changes the logic and I understand how > difficult and how hard this problem is, trust me I would not be able to > tackle this. Basically I need to answer the question; what happens when 1 of > 2 multi_mds fails with no standbys ready to come save them? > What I have tested is not the same of a single active MDS; this absolutely > changes the logic of what happens and how we troubleshoot. The CephFS is > still alive and it does allow operations and does allow resources to go > through. How, why and what is affected are very relevant questions if this > is what the failure looks like since it is not 100% blocking. Okay so now I understand what your real question is: what is the state of CephFS when one or more ranks have failed but no standbys exist to takeover? The answer is that there may be partial availability from the up:active ranks which may hand out capabilities for the subtrees they manage or no availability if that's not possible because it cannot obtain the necessary locks. No metadata is lost. No inconsistency is created between clients. Full availability will be restored when the lost ranks come back online. -- Patrick Donnelly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
Ok let me try to explain this better, we are doing this back and forth and its not going anywhere. I'll just be as genuine as I can and explain the issue. What we are testing is a critical failure scenario and actually more of a real world scenario. Basically just what happens when it is 1AM and the shit hits the fan, half of your servers are down and 1 of the 3 MDS boxes are still alive. There is one very important fact that happens with CephFS and when the single Active MDS server fails. It is guaranteed 100% all IO is blocked. No split-brain, no corrupted data, 100% guaranteed ever since we started using CephFS Now with multi_mds, I understand this changes the logic and I understand how difficult and how hard this problem is, trust me I would not be able to tackle this. Basically I need to answer the question; what happens when 1 of 2 multi_mds fails with no standbys ready to come save them? What I have tested is not the same of a single active MDS; this absolutely changes the logic of what happens and how we troubleshoot. The CephFS is still alive and it does allow operations and does allow resources to go through. How, why and what is affected are very relevant questions if this is what the failure looks like since it is not 100% blocking. This is the problem, I have programs writing a massive amount of data and I don't want it corrupted or lost. I need to know what happens and I need to have guarantees. Best On Thu, Apr 26, 2018 at 5:03 PM Patrick Donnellywrote: > On Thu, Apr 26, 2018 at 4:40 PM, Scottix wrote: > >> Of course -- the mons can't tell the difference! > > That is really unfortunate, it would be nice to know if the filesystem > has > > been degraded and to what degree. > > If a rank is laggy/crashed, the file system as a whole is generally > unavailable. The span between partial outage and full is small and not > worth quantifying. > > >> You must have standbys for high availability. This is the docs. > > Ok but what if you have your standby go down and a master go down. This > > could happen in the real world and is a valid error scenario. > >Also there is > > a period between when the standby becomes active what happens in-between > > that time? > > The standby MDS goes through a series of states where it recovers the > lost state and connections with clients. Finally, it goes active. > > >> It depends(tm) on how the metadata is distributed and what locks are > > held by each MDS. > > Your saying depending on which mds had a lock on a resource it will block > > that particular POSIX operation? Can you clarify a little bit? > > > >> Standbys are not optional in any production cluster. > > Of course in production I would hope people have standbys but in theory > > there is no enforcement in Ceph for this other than a warning. So when > you > > say not optional that is not exactly true it will still run. > > It's self-defeating to expect CephFS to enforce having standbys -- > presumably by throwing an error or becoming unavailable -- when the > standbys exist to make the system available. > > There's nothing to enforce. A warning is sufficient for the operator > that (a) they didn't configure any standbys or (b) MDS daemon > processes/boxes are going away and not coming back as standbys (i.e. > the pool of MDS daemons is decreasing with each failover) > > -- > Patrick Donnelly > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
On Thu, Apr 26, 2018 at 4:40 PM, Scottixwrote: >> Of course -- the mons can't tell the difference! > That is really unfortunate, it would be nice to know if the filesystem has > been degraded and to what degree. If a rank is laggy/crashed, the file system as a whole is generally unavailable. The span between partial outage and full is small and not worth quantifying. >> You must have standbys for high availability. This is the docs. > Ok but what if you have your standby go down and a master go down. This > could happen in the real world and is a valid error scenario. >Also there is > a period between when the standby becomes active what happens in-between > that time? The standby MDS goes through a series of states where it recovers the lost state and connections with clients. Finally, it goes active. >> It depends(tm) on how the metadata is distributed and what locks are > held by each MDS. > Your saying depending on which mds had a lock on a resource it will block > that particular POSIX operation? Can you clarify a little bit? > >> Standbys are not optional in any production cluster. > Of course in production I would hope people have standbys but in theory > there is no enforcement in Ceph for this other than a warning. So when you > say not optional that is not exactly true it will still run. It's self-defeating to expect CephFS to enforce having standbys -- presumably by throwing an error or becoming unavailable -- when the standbys exist to make the system available. There's nothing to enforce. A warning is sufficient for the operator that (a) they didn't configure any standbys or (b) MDS daemon processes/boxes are going away and not coming back as standbys (i.e. the pool of MDS daemons is decreasing with each failover) -- Patrick Donnelly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
> Of course -- the mons can't tell the difference! That is really unfortunate, it would be nice to know if the filesystem has been degraded and to what degree. > You must have standbys for high availability. This is the docs. Ok but what if you have your standby go down and a master go down. This could happen in the real world and is a valid error scenario. Also there is a period between when the standby becomes active what happens in-between that time? > It depends(tm) on how the metadata is distributed and what locks are held by each MDS. Your saying depending on which mds had a lock on a resource it will block that particular POSIX operation? Can you clarify a little bit? > Standbys are not optional in any production cluster. Of course in production I would hope people have standbys but in theory there is no enforcement in Ceph for this other than a warning. So when you say not optional that is not exactly true it will still run. On Thu, Apr 26, 2018 at 3:37 PM Patrick Donnellywrote: > On Thu, Apr 26, 2018 at 3:16 PM, Scottix wrote: > > Updated to 12.2.5 > > > > We are starting to test multi_mds cephfs and we are going through some > > failure scenarios in our test cluster. > > > > We are simulating a power failure to one machine and we are getting mixed > > results of what happens to the file system. > > > > This is the status of the mds once we simulate the power loss considering > > there are no more standbys. > > > > mds: cephfs-2/2/2 up > > {0=CephDeploy100=up:active,1=TigoMDS100=up:active(laggy or crashed)} > > > > 1. It is a little unclear if it is laggy or really is down, using this > line > > alone. > > Of course -- the mons can't tell the difference! > > > 2. The first time we lost total access to ceph folder and just blocked > i/o > > You must have standbys for high availability. This is the docs. > > > 3. One time we were still able to access ceph folder and everything > seems to > > be running. > > It depends(tm) on how the metadata is distributed and what locks are > held by each MDS. > > Standbys are not optional in any production cluster. > > -- > Patrick Donnelly > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Multi-MDS Failover
On Thu, Apr 26, 2018 at 3:16 PM, Scottixwrote: > Updated to 12.2.5 > > We are starting to test multi_mds cephfs and we are going through some > failure scenarios in our test cluster. > > We are simulating a power failure to one machine and we are getting mixed > results of what happens to the file system. > > This is the status of the mds once we simulate the power loss considering > there are no more standbys. > > mds: cephfs-2/2/2 up > {0=CephDeploy100=up:active,1=TigoMDS100=up:active(laggy or crashed)} > > 1. It is a little unclear if it is laggy or really is down, using this line > alone. Of course -- the mons can't tell the difference! > 2. The first time we lost total access to ceph folder and just blocked i/o You must have standbys for high availability. This is the docs. > 3. One time we were still able to access ceph folder and everything seems to > be running. It depends(tm) on how the metadata is distributed and what locks are held by each MDS. Standbys are not optional in any production cluster. -- Patrick Donnelly ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Multi-MDS Failover
Updated to 12.2.5 We are starting to test multi_mds cephfs and we are going through some failure scenarios in our test cluster. We are simulating a power failure to one machine and we are getting mixed results of what happens to the file system. This is the status of the mds once we simulate the power loss considering there are no more standbys. mds: cephfs-2/2/2 up {0=CephDeploy100=up:active,1=TigoMDS100=up:active(laggy or crashed)} 1. It is a little unclear if it is laggy or really is down, using this line alone. 2. The first time we lost total access to ceph folder and just blocked i/o 3. One time we were still able to access ceph folder and everything seems to be running. 4. One time we had a script creating a bunch of files, simulated the crash, then we list the directory and showed 0 files, expected should be lots of files. I mean we could go into details of each of those, but really I am trying to understand ceph logic in dealing with a crashed multi mds or if you mark it degraded? or what is going on. It just seems a little unclear what is going to happen. Good news once it comes back online everything is as it should be. Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com