Hi All,

Current Desgin and its limitations:

        Geo-replication syncs changes across geography using changelogs captured
  by changelog translator. Changelog translator sits on server side just above 
posix
  translator. Hence, in distributed replicated setup, both replica pairs collect
  changelogs w.r.t their bricks. Geo-replication syncs the changes using only 
one
  brick among the replica pair at a time, calling it as "ACTIVE" and other non 
syncing
  brick as "PASSIVE".
   
         Let's consider below example of distributed replicated setup where
         NODE-1 as b1 and its replicated brick b1r is in NODE-2

                        NODE-1                         NODE-2
                          b1                            b1r

  At the beginning, geo-replication chooses to sync changes from NODE-1:b1 and 
  NODE-2:b1r will be "PASSIVE". The logic depends on virtual getxattr 
  'trusted.glusterfs.node-uuid' which always returns first up subvolume i.e., 
NODE-1.
  When NODE-1 goes down, the above xattr returns NODE-2 and that is made 
'ACTIVE'.
  But when NODE-1 comes back again, the above xattr returns NODE-1 and it is 
made
  'ACTIVE' again. So for a brief interval of time, if NODE-2 had not finished 
processing
   the changelog, both NODE-2 and NODE-1 will be ACTIVE causing rename race as 
below.
   
   https://bugzilla.redhat.com/show_bug.cgi?id=1140183


SOLUTION:
   Don't make NODE-2 'PASSIVE' when NODE-1 comes back again untill NODE-2 goes 
down.


APPROACH TO SOLVE WHICH I CAN THINK OF:

Have a distributed store of a file, which captures the bricks which are active.
When a NODE goes down, the file is updated with it's replica bricks making
sure, at any point in time, the file has all the bricks to be made active.
Geo-replication worker process is made 'ACTIVE' only if it is in the file.

 Implementation can be in two ways:

  1. Have a distributed store for above implementation. This needs to be thought
     of as distributed store is not in place in glusterd yet.

  2. Other solution is to store in a file similar to existing glusterd global
     configuration file (/var/lib/glusterd/options). When this file is updated,
     version number is incremented. When the node which is gone down, comes up,
     gets this file from peers if it's version number is less that of peers. 


I did a POC with second approach storing list of active bricks 
'NodeUUID:brickpath'
in options file itself. It seems to work fine except the bug in glusterd where 
the
daemons are getting spawned before the node gets 'options' file from other node 
during
handshake.

CHANGES IN GLUSTERD:
    When a node goes down, all the other nodes are notified through 
glusterd_peer_rpc_notify,
  where, it needs to find the replicas of the node which went down and update 
the global
  file.

PROBLEMS/LIMITATIONS WITH THIS APPRAOCH:
    1. If glusterd is killed and the node is still up, this makes the other 
replica 'ACTIVE'.
       So both replica bricks will be syncing at this point of time which is 
not expected.

    2. If the single brick process is killed, it's replica brick is not made 
'ACTIVE'.


Glusterd/AFR folks,

    1. Do you see a better approach other than above to solve this issue? 
    2. Is this approach feasible? If yes, how can I handle the problems 
mentioned above ?
    3. Is this approach feasible from scalability point of view since complete 
list of active
       brick path is stored and read by gsyncd ?
    3. Does this approach fits into three way replication and erasure coding?



Thanks and Regards,
Kotresh H R 
_______________________________________________
Gluster-devel mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to