Re: [Gluster-users] How to re-sync

Chad Wed, 10 Mar 2010 09:29:13 -0800

I guess I just see healing differently than the development team does.
I hope you will consider adding an option that will allow switching between the 
current auto self-healing and a new manual self-healing I will outline below.


How I see "manual self-healing" working:
I look at healing similarly to the way that mdadm looks at disk drives in raid 
5.

1. I think when a server goes down it should be "flagged as faulty" and send out notifications (in this case to all the clients notifying them not to use thedowned server [I hope this removes the small hang delay for clients 2-N], as well as via e-mail to the sysadmin).

2. The down server is then taken "out of service" but the rest of the array continues in 
"degraded mode".
3. Then when the down server comes back and starts glusterfsd it remains 
"faulty" and no client can use it.
4. A sysadmin comes in and "fixes" the problem that took the server down in the 
first place.
5. A sysadmin changes the "faulty flag" to a "resync flag" ("resync flag" tells 
the clients to write to the machine, but not read from it while it recovers).
6. A sysadmin then runs a re-sync (ls -alR).
7. Once the re-sync completes a sysadmin runs a "re-add" command removing the 
"faulty flag" and the clients can begin using the server again.

I feel that this method removes the chance that a server goes down, gets out of sync, recovers on its own (or though automated tools), and starts providingservices with some old data.In the middle of the night if the server goes down, and nagios trips a reboot, then the server comes up, no sysadmin is logged in to run the "ls -alR" to getthe server to re-sync.


^C



Tejas N. Bhise wrote:

Ed, Chad, Stephen,
We believe we have fixed all ( known ) problems with self-healin the latest releases and hence we would be very interested ingetting diagnostics if you can reproduce the problem or see itagain frequently.Please collect the logs by running the client and servers withlog level TRACE and then reproducing the problem.
Also collect the backend extended attributes of the file onboth servers before self-heal was triggered. This command can beused to get that info:
# getfattr -d -m '.*' -e hex <filename>

Thank you for you help to debug if any new problem shows up.
Feel free to ask if you have any queries about this.

Regards,
Tejas.




----- Original Message -----
From: "Ed W" <[email protected]>
To: "Gluster Users" <[email protected]>
Sent: Monday, March 8, 2010 5:22:40 PM GMT +05:30 Chennai, Kolkata, Mumbai, New 
Delhi
Subject: Re: [Gluster-users] How to re-sync

On 07/03/2010 16:02, Chad wrote:
Is there a gluster developer out there working on this problemspecifically?Could we add some kind of "sync done" command that has to be runmanually and until it is the failed node is not used?The bottom line for me is that I would much rather run on aperformance degraded array until a sysadmin intervenes, than loose anydata.
I'm only in evaluation mode at the moment, but resolving split brain issomething which is terrifying me at the moment and I have been givingsome thought to how it needs to be done with various solutions
In the case of gluster it really does seem very important to figure outa reliable way to know when the system is fully synced again if you havehad an outage. For example a not unrealistic situation if you weredoing a bunch of upgrades would be:
- Turn off server 1 (S1) and upgrade, server 2 (S2) deviates from S1
- Turn on server 1 and expect to sync all new changes from while we weredown - key expectation here is that S1 only includes changes from S2 andnever sends changes.
- Some event marks sync complete so that we can turn off S2 and upgrade it
The problem otherwise if you don't do the sync is that you turn off S2and now S1 doesn't know about changes made while it's off and serves upincomplete information. Split brain can occur where a file is changedon both servers while they couldn't talk to each other and then changesmust be lost...
I suppose a really cool translator could be written to track changesmade to an AFR group where one member is missing and then the out ofsync file list would be resupplied once it was turned on again in orderto speed up replication... Kind of a lot of work for a smallimprovement, but could be interesting to create...
Perhaps some dev has some other suggestions on a "procedure" to followto avoid split brain in the situation that we need to turn off allservers one by one in an AFR group?
Thanks

Ed W


_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
[email protected]
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users

Re: [Gluster-users] How to re-sync

Reply via email to