So, I lost one of my servers and the OS was reinstalled.  The gluster data is 
on another disk that survives OS reinstalls.  /var/lib/gluster however does not.

I was following the bring it back up directions, but before I did that, I think 
a peer probe was done with the new uuid.  This caused it to be dropped from the 
cluster, entirely.

I edited the uuid to be back what it was, but now it is no longer in the 
cluster.  The web site didn’t seem to have any help for how to undo the drop.  
It was part of a replica 2 pair, and I would like to merely have it come up and 
be apart of the cluster again.  It has all the data (as I run with quorum and 
all the replica 2 pair contents are R/O until this server comes back).  I don’t 
mind letting it refresh from the other pair member of the replica, even though 
the data is already on disk.

I tried:

# gluster volume replace-brick g2home machine04:/.g/g2home 
machine04:/.g/g2home-new commit force
volume replace-brick: failed: Host machine04 is not in 'Peer in Cluster’ state

to try and let it resync into the cluster, but, it won’t let me replace the 
brick.  I can’t do:

# gluster peer detach machine04
peer detach: failed: Brick(s) with the peer machine04 exist in cluster

either.  What I wanted it to do, it when it connected to the cluster the first 
time with the new uuid, the cluster should inform it it might have filesystems 
on it (it comes in with a name already in the peer list), and get brick 
information from the cluster and check it out.  If it has those, it should just 
notice the uuid is wrong, fix it, make it part of the cluster again, spin it 
all up and continue on.

I tried;

# gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
volume add-brick: failed: Volume g2home does not exist

and it didn’t work on either machine04, nor one of the peers:

# gluster volume add-brick g2home replica 2 machine04:/.g/g2home-new
volume add-brick: failed: Operation failed

So, to try and fix the Peer in Cluster issue, I stop and restarted glistered 
many time, and eventually most all resented and came up into the Peer in 
Cluster state.  All except for 1 that was endlessly confused.  So, if the 
network works, it should wipe the peer, and just retry the entire state machine 
to get back into the right state.  I had to stop the server on the two machines 
and then manually edit the state to be 3, and then restart them.  It then at 
least showed the right state on both.

Next, let’s try and sync up the bricks:

root@machine04:/# gluster volume sync machine00 all
Sync volume may make data inaccessible while the sync is in progress. Do you 
want to continue? (y/n) y
volume sync: success
root@machine04:/# gluster vol info
No volumes present

root@machine02:/# gluster volume heal g2home full
Staging failed on machine04. Error: Volume g2home does not exist

Think about that.  This is a replica 2 server, the entire point would be to fix 
up the array if one of the machines was screwy.  heal seemed like the command 
to fix it up.

So, now that it is connected, let’s try this again:

# gluster volume replace-brick g2home machine04:/.g/g2home 
machine04:/.g/g2home-new commit force
volume replace-brick: failed: Pre Validation failed on machine04. volume: 
g2home does not exist

Nope, that won’t work.  So, let’s try removing:

# gluster vol remove-brick g2home replica 2 machine04:/.g/g2home 
machine05:/.g/g2home start
volume remove-brick start: failed: Staging failed on machine04. Please check 
log file for details.

Nope, that won’t either. What’s the point of remove, if it won’t work?

Ok, fine, lets for for a bigger hammer:

# gluster peer detach machine04 force
peer detach: failed: Brick(s) with the peer machine04 exist in cluster

Doh.  I know that, but, it is a replica!

[ more googling ]

Someone said to just copy the entire vols directory.  [ cross fingers ] copy 
vols.

Ok, I can now do a gluster volume status g2home detail, which I could not 
before.  Files seem to be R/W on the array now.  I think that might have worked.

So, why can’t gluster copy vols by itself, if indeed that is the right thing to 
do?

Why can’t the document say, just edit the state variable and just copy vols to 
get it going again?

Why can’t probe figure out that you were already part of a cluster, and when it 
runs, it notices that your brains have been wiped, and just grab that info from 
the cluster and bring the node back up?  It can even run heal on the data to 
ensure that nothing messed with it and that it matches the other replica.
_______________________________________________
Gluster-users mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-users

Reply via email to