On 11/05/14 12:23, Ravishankar N wrote: > On 11/05/2014 03:18 PM, Andreas Hollaus wrote: >> Hi, >> >> I'm curious about this 5 phase transaction scheme that is described in the >> document >> (lock, pre-op, op, post-op, unlock). >> Are these stage switches all triggered from the client or can the server do >> it >> without notifying the client, for instance switching from 'op' to 'post-op'? > > All stages are performed by the AFR translator in the client graph, where it > is > loaded, in the sequence you listed. So the counters are stored on the servers (as extended attributes on the bricks), but increased and decreased by the client after fetching them from the servers? If so, I guess that the messages between those are just synchronous file system operations like read extended attributes, write file etc.
Is the client created whenever a GlusterFS volume is mounted? As I'm running both server and client on the same board it's a bit hard to distinguish them from each other. >> Decreasing the counter for the local pending operations could be done >> without talking >> to the client, even though I realize a message has to sent to the other >> server(s), >> possibly through the client. >> >> The reason I ask is that I'm trying to estimate the risk of ending up in a >> split >> brain situation, or at least understand if our servers will 'accuse' each >> other >> temporarily during this 5 phase transaction under normal circumstances. If I >> understand who sends messages to who and I what order, I'll have a better >> chance to >> see if we require any solution to split brain situations. As I've experienced >> problems to setup the 'favorite-child' option, I want to know if it's >> required or >> not. In our use case, quorum is not a solution, but losing some data is >> acceptable as >> long as the bricks are in sync. > If a file is split-brained, AFR does not allow modifications by clients on it > until the split-brain is resolved. The afr xattrs and heal mechanisms ensure > that > the bricks are in sync, so worries on that front. I know about the input/output error in case of a split brain and that is something we must avoid at any cost. That's the reason why 'favorite-child' seems like a good idea for us, but my filter script is not executed even though I tried a couple of probable locations to store it at. It's a bit hard to be absolutely sure what that filter path macro contained at the time the GlusterFS package was built. It would have been easier if the path existed, even though it was empty if no filters were used. According to the source code, there are some return statements due to errors that could also be the reason for not running the filter script. Are there any ways to set verbose level to get some more clues to what's going on? Regards Andreas > Thanks, > Ravi >> >> Regards >> Andreas >> >> On 10/31/14 15:37, Ravishankar N wrote: >>> On 10/30/2014 07:23 PM, Andreas Hollaus wrote: >>>> Hi, >>>> >>>> Thanks! Seems like an interesting document. Although I've read blogs about >>>> how >>>> extended attributes are used as a change log, this seams like a more >>>> comprehensive >>>> document. >>>> >>>> I won't write directly to any brick. That's the reason I first have to >>>> create a >>>> volume which consists of only one brick, until the other server is >>>> available, and >>>> then add that second brick. I don't want to delay the file system clients >>>> until the >>>> second server is available, hence the reason for add-brick. >>>> >>>> I guess that this procedure is only needed the first time the volume is >>>> configured, >>>> right? If any of these bricks would fail later on, the change log would >>>> keep >>>> track of >>>> all changes to the file system even though only one of the bricks is >>>> available(?). >>> Yes, if one one brick of a replica pair goes down, the other one keeps >>> track of >>> file modifications by the client, and would sync it back to the first one >>> when it >>> comes back up. >>> >>>> After a restart, volume settings stored in the configuration file would be >>>> accepted >>>> even though not all servers were up and running yet at that time, wouldn't >>>> they? >>> glusterd running on all nodes ensures that the volume configurations stored >>> on each >>> node are in sync. >>>> Speaking about configuration files. When are these copied to each server? >>>> If I create a volume which consists of two bricks, I guess that those >>>> servers will >>>> create the configuration files, independently of each other, from the >>>> information >>>> sent from the client (gluster volume create...). >>> All volume config/management commands must be run from any of the servers >>> that make >>> up the volume and not the client (unless both happen to be in the same >>> machine). As >>> mentioned above, when any of the volume commands are run on any one server, >>> glusterd orchestrates the necessary action on all servers and keeps them in >>> sync. >>>> In case I later on add a brick, I guess that the settings have to be >>>> copied >>>> to the >>>> new brick after they have been modified on the first one, right (or will >>>> they be >>>> recreated on all servers from the information specified by the client, >>>> like in the >>>> previous case)? >>>> >>>> Will configuration files be copied in other situations as well, for >>>> instance in >>>> case >>>> one of the servers which is part of the volume for some reason would be >>>> missing >>>> those >>>> files? In my case, the root file system is recreated from an image at each >>>> reboot, so >>>> everything created in /etc will be lost. Will GlusterFS settings be >>>> restored >>>> from the >>>> other server automatically >>> No, it is expected that servers have persistent file-systems. There are >>> ways to >>> restore such bricks; see >>> http://gluster.org/community/documentation/index.php/Gluster_3.4:_Brick_Restoration_-_Replace_Crashed_Server >>> >>> >>> -Ravi >>>> or do I need to backup and restore those myself? Even >>>> though the brick doesn't know that it is part of a volume in case it lose >>>> the >>>> configuration files, both the other server(s) and the client(s) will >>>> probably >>>> recognize it as being part of the volume. I therefore believe that such a >>>> self-healing would actually be possible, even though it may not be >>>> implemented. >>>> >>>> >>>> Regards >>>> Andreas >>>> On 10/30/14 05:21, Ravishankar N wrote: >>>>> On 10/28/2014 03:58 PM, Andreas Hollaus wrote: >>>>>> Hi, >>>>>> >>>>>> I'm curious about how GlusterFS manages to sync the bricks in the >>>>>> initial phase, >>>>>> when >>>>>> the volume is created or >>>>>> extended. >>>>>> >>>>>> I first create a volume consisting of only one brick, which clients will >>>>>> start to >>>>>> read and write. >>>>>> After a while I add a second brick to the volume to create a replicated >>>>>> volume. >>>>>> >>>>>> If this new brick is empty, I guess that files will be copied from the >>>>>> first >>>>>> brick to >>>>>> get the bricks in sync, right? >>>>>> >>>>>> However, if the second brick is not empty but rather contains a subset >>>>>> of the >>>>>> files >>>>>> on the first brick I don't see >>>>>> how GlusterFS will solve the problem of syncing the bricks. >>>>>> >>>>>> I guess that all files which lack extended attributes could be removed >>>>>> in this >>>>>> scenario, because they were created >>>>>> when the disk was not part of a GlusterFS volume. However, in case the >>>>>> brick was >>>>>> used >>>>>> in the volume previously, >>>>>> for instance before that server restarted, there will be extended >>>>>> attributes for >>>>>> the >>>>>> files on the second brick which >>>>>> weren't updated during the downtime (when the volume consisted of only >>>>>> one >>>>>> brick). >>>>>> There could be multiple >>>>>> changes to the files during this time. In this case I don't understand >>>>>> how the >>>>>> extended attributes could be used to >>>>>> determine which of the bricks contains the most recent file. >>>>>> >>>>>> Can anyone explain how this works? Is it only allowed to add empty >>>>>> bricks to a >>>>>> volume? >>>>>> >>>>>> >>>>> It is allowed to add only empty bricks to the volume. Writing directly to >>>>> bricks is >>>>> not supported. One needs to access the volume only from a mount point or >>>>> using >>>>> libgfapi. >>>>> After adding a brick to increase the distribute count, you need to run >>>>> the volume >>>>> rebalance command so that the some of the existing files are hashed >>>>> (moved) to >>>>> this >>>>> newly added brick. >>>>> After adding a brick to increase the replica count, you need to run the >>>>> volume >>>>> heal >>>>> full command to sync the files from the other replica into the newly >>>>> added brick. >>>>> https://github.com/gluster/glusterfs/blob/master/doc/features/afr-v1.md >>>>> will give >>>>> you an idea of how the replicate translator uses xattrs to keep files in >>>>> sync. >>>>> >>>>> HTH, >>>>> Ravi >> > _______________________________________________ Gluster-users mailing list [email protected] http://supercolony.gluster.org/mailman/listinfo/gluster-users
