On Fri, Mar 16, 2018 at 4:57 AM, Victor T <[email protected]> wrote:
> Xavi, does that mean that even if every node was rebooted one at a time > even without issuing a heal that the volume would have no issues after > running gluster volume heal [volname] when all bricks are back online? > No. After bringing up one brick and before stopping the next one, you need to be sure that there are no damaged files. You shouldn't reboot a node if "gluster volume heal <volname> info" shows damaged files. The command "gluster volume heal <volname>" is only a tool to force heal to progress (until the bug is fixed). Xavi > > ------------------------------ > *From:* Xavi Hernandez <[email protected]> > *Sent:* Thursday, March 15, 2018 12:09:05 AM > *To:* Victor T > *Cc:* [email protected] > *Subject:* Re: [Gluster-users] Disperse volume recovery and healing > > Hi Victor, > > On Wed, Mar 14, 2018 at 12:30 AM, Victor T <[email protected]> > wrote: > > I have a question about how disperse volumes handle brick failure. I'm > running version 3.10.10 on all systems. If I have a disperse volume in a > 4+2 configuration with 6 servers each serving 1 brick, and maintenance > needs to be performed on all systems, are there any general steps that need > to be taken to ensure data is not lost or service interrupted? For example, > can I just reboot each system sequentially after making sure sure the > service is running on all servers before rebooting the next system? Or is > there a need to force/wait for a heal after each brick comes back online? > If I have two bricks down for multiple days and then bring them back in, is > there a need to issue a heal or something like a rebalance before rebooting > the other servers? There's lots of documentation about other volume types, > but it seems information specific to dispersed volumes is a bit hard to > find. Thanks a bunch. > > > On a 4+2 configuration you could bring down up to 2 bricks simultaneously > for maintenance. However if something happens to one of the remaining 4 > bricks, the volume would stop working. So in this case I would recommend to > not have more than one server down for maintenance at the same time unless > the down time is very very small. > > Once the stopped servers come back up again, you need to wait until all > files are healed before proceeding with the next server. Failing to do so > means that some files could have more than 2 non-healthy versions, what > will make the file inaccessible until enough healthy versions are available > again. > > Self-heal should be automatically triggered once the bricks come online, > however there was a bug (https://bugzilla.redhat.com/ > show_bug.cgi?id=1547662) that could cause delays in the self-heal > process. This bug should be fixed in the next version. Meantime you can > force self-heal to progress by issuing "gluster volume heal <volname>" > commands each time it seems to have stopped. > > Once the output of "gluster volume heal <volname> info" reports 0 pending > files on all bricks, you can proceed with the maintenance of the next > server. > > No need to do any rebalance for down bricks. Rebalance is basically needed > when volume is expanded with more bricks. > > Xavi > > > _______________________________________________ > Gluster-users mailing list > [email protected] > http://lists.gluster.org/mailman/listinfo/gluster-users > > >
_______________________________________________ Gluster-users mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-users
