Re: [Gluster-users] Replica 3 - how to replace failed node (peer)

Martin Toth Tue, 16 Apr 2019 08:44:43 -0700

Thanks for clarification, one more question.

When I will recover(boot) failed node back and this peer will be available 
again to remaining two nodes. How do I tell gluster to mark this brick as 
failed ?


I mean, I’ve booted failed node back without networking. Disk partition (ZFS 
pool on another disks) where brick was before failure is lost.
Now I can start gluster event when I don't have ZFS pool where failed brick was 
before ?

This wont be a problem when I will connect this node back to cluster ? (before 
brick replace/reset command will be issued)

Thanks. BR!
Martin

> On 11 Apr 2019, at 15:40, Karthik Subrahmanya <[email protected]> wrote:
> 
> 
> 
> On Thu, Apr 11, 2019 at 6:38 PM Martin Toth <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Karthik,
> 
>> On Thu, Apr 11, 2019 at 12:43 PM Martin Toth <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi Karthik,
>> 
>> more over, I would like to ask if there are some recommended 
>> settings/parameters for SHD in order to achieve good or fair I/O while 
>> volume will be healed when I will replace Brick (this should trigger healing 
>> process). 
>> If I understand you concern correctly, you need to get fair I/O performance 
>> for clients while healing takes place as part of  the replace brick 
>> operation. For this you can turn off the "data-self-heal" and 
>> "metadata-self-heal" options until the heal completes on the new brick.
> 
> This is exactly what I mean. I am running VM disks on remaining 2 (out of 3 - 
> one failed as mentioned) nodes and I need to ensure there will be fair I/O 
> performance available on these two nodes while replace brick operation will 
> heal volume.
> I will not run any VMs on node where replace brick operation will be running. 
> So if I understand correctly, when I will set :
> 
> # gluster volume set <volname> cluster.data-self-heal off
> # gluster volume set <volname> cluster.metadata-self-heal off
> 
> this will tell Gluster clients (libgfapi and FUSE mount) not to read from 
> node “where replace brick operation” is in place but from remaing two healthy 
> nodes. Is this correct ? Thanks for clarification.
> The reads will be served from one of the good bricks since the file will 
> either be not present on the replaced brick at the time of read or it will be 
> present but marked for heal if it is not already healed. If already healed by 
> SHD, then it could be served from the new brick as well, but there won't be 
> any problem in reading from there in that scenario.
> By setting these two options whenever a read comes from client it will not 
> try to heal the file for data/metadata. Otherwise it would try to heal (if 
> not already healed by SHD) when the read comes on this, hence slowing down 
> the client.
> 
>> Turning off client side healing doesn't compromise data integrity and 
>> consistency. During the read request from client, pending xattr is evaluated 
>> for replica copies and read is only served from correct copy. During writes, 
>> IO will continue on both the replicas, SHD will take care of healing files.
>> After replacing the brick, we strongly recommend you to consider upgrading 
>> your gluster to one of the maintained versions. We have many stability 
>> related fixes there, which can handle some critical issues and corner cases 
>> which you could hit during these kind of scenarios.
> 
> This will be first priority in infrastructure after fixing this cluster back 
> to fully functional replica3. I will upgrade to 3.12.x and then to version 5 
> or 6.
> Sounds good.
> 
> If you are planning to have the same name for the new brick and if you get 
> the error like "Brick may be containing or be contained by an existing brick" 
> even after using the force option, try  using a different name. That should 
> work.
> 
> Regards,
> Karthik 
> 
> BR, 
> Martin
> 
>> Regards,
>> Karthik
>> I had some problems in past when healing was triggered, VM disks became 
>> unresponsive because healing took most of I/O. My volume containing only big 
>> files with VM disks.
>> 
>> Thanks for suggestions.
>> BR, 
>> Martin
>> 
>>> On 10 Apr 2019, at 12:38, Martin Toth <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Thanks, this looks ok to me, I will reset brick because I don't have any 
>>> data anymore on failed node so I can use same path / brick name.
>>> 
>>> Is reseting brick dangerous command? Should I be worried about some 
>>> possible failure that will impact remaining two nodes? I am running really 
>>> old 3.7.6 but stable version.
>>> 
>>> Thanks,
>>> BR!
>>> 
>>> Martin
>>>  
>>> 
>>>> On 10 Apr 2019, at 12:20, Karthik Subrahmanya <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Hi Martin,
>>>> 
>>>> After you add the new disks and creating raid array, you can run the 
>>>> following command to replace the old brick with new one:
>>>> 
>>>> - If you are going to use a different name to the new brick you can run
>>>> gluster volume replace-brick <volname> <old-brick> <new-brick> commit force
>>>> 
>>>> - If you are planning to use the same name for the new brick as well then 
>>>> you can use
>>>> gluster volume reset-brick <volname> <old-brick> <new-brick> commit force
>>>> Here old-brick & new-brick's hostname &  path should be same.
>>>> 
>>>> After replacing the brick, make sure the brick comes online using volume 
>>>> status.
>>>> Heal should automatically start, you can check the heal status to see all 
>>>> the files gets replicated to the newly added brick. If it does not start 
>>>> automatically, you can manually start that by running gluster volume heal 
>>>> <volname>.
>>>> 
>>>> HTH,
>>>> Karthik
>>>> 
>>>> On Wed, Apr 10, 2019 at 3:13 PM Martin Toth <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> Hi all,
>>>> 
>>>> I am running replica 3 gluster with 3 bricks. One of my servers failed - 
>>>> all disks are showing errors and raid is in fault state.
>>>> 
>>>> Type: Replicate
>>>> Volume ID: 41d5c283-3a74-4af8-a55d-924447bfa59a
>>>> Status: Started
>>>> Number of Bricks: 1 x 3 = 3
>>>> Transport-type: tcp
>>>> Bricks:
>>>> Brick1: node1.san:/tank/gluster/gv0imagestore/brick1
>>>> Brick2: node2.san:/tank/gluster/gv0imagestore/brick1 <— this brick is down
>>>> Brick3: node3.san:/tank/gluster/gv0imagestore/brick1
>>>> 
>>>> So one of my bricks is totally failed (node2). It went down and all data 
>>>> are lost (failed raid on node2). Now I am running only two bricks on 2 
>>>> servers out from 3.
>>>> This is really critical problem for us, we can lost all data. I want to 
>>>> add new disks to node2, create new raid array on them and try to replace 
>>>> failed brick on this node. 
>>>> 
>>>> What is the procedure of replacing Brick2 on node2, can someone advice? I 
>>>> can’t find anything relevant in documentation.
>>>> 
>>>> Thanks in advance,
>>>> Martin
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users 
>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Replica 3 - how to replace failed node (peer)

Reply via email to