Re: [Gluster-users] Replica 3 - how to replace failed node (peer)

Martin Toth Sat, 20 Apr 2019 02:29:46 -0700

Just for other users.. they may find this usefull.

I finally started Gluster server process on failed node that lost brick and all 
went OK.
Server is again available as a peer and failed brick is not running, so I can 
continue with replace brick/ reset brick operation.


> On 16 Apr 2019, at 17:44, Martin Toth <[email protected]> wrote:
> 
> Thanks for clarification, one more question.
> 
> When I will recover(boot) failed node back and this peer will be available 
> again to remaining two nodes. How do I tell gluster to mark this brick as 
> failed ?
> 
> I mean, I’ve booted failed node back without networking. Disk partition (ZFS 
> pool on another disks) where brick was before failure is lost.
> Now I can start gluster event when I don't have ZFS pool where failed brick 
> was before ?
> 
> This wont be a problem when I will connect this node back to cluster ? 
> (before brick replace/reset command will be issued)
> 
> Thanks. BR!
> Martin
> 
>> On 11 Apr 2019, at 15:40, Karthik Subrahmanya <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> 
>> 
>> On Thu, Apr 11, 2019 at 6:38 PM Martin Toth <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Hi Karthik,
>> 
>>> On Thu, Apr 11, 2019 at 12:43 PM Martin Toth <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Hi Karthik,
>>> 
>>> more over, I would like to ask if there are some recommended 
>>> settings/parameters for SHD in order to achieve good or fair I/O while 
>>> volume will be healed when I will replace Brick (this should trigger 
>>> healing process). 
>>> If I understand you concern correctly, you need to get fair I/O performance 
>>> for clients while healing takes place as part of  the replace brick 
>>> operation. For this you can turn off the "data-self-heal" and 
>>> "metadata-self-heal" options until the heal completes on the new brick.
>> 
>> This is exactly what I mean. I am running VM disks on remaining 2 (out of 3 
>> - one failed as mentioned) nodes and I need to ensure there will be fair I/O 
>> performance available on these two nodes while replace brick operation will 
>> heal volume.
>> I will not run any VMs on node where replace brick operation will be 
>> running. So if I understand correctly, when I will set :
>> 
>> # gluster volume set <volname> cluster.data-self-heal off
>> # gluster volume set <volname> cluster.metadata-self-heal off
>> 
>> this will tell Gluster clients (libgfapi and FUSE mount) not to read from 
>> node “where replace brick operation” is in place but from remaing two 
>> healthy nodes. Is this correct ? Thanks for clarification.
>> The reads will be served from one of the good bricks since the file will 
>> either be not present on the replaced brick at the time of read or it will 
>> be present but marked for heal if it is not already healed. If already 
>> healed by SHD, then it could be served from the new brick as well, but there 
>> won't be any problem in reading from there in that scenario.
>> By setting these two options whenever a read comes from client it will not 
>> try to heal the file for data/metadata. Otherwise it would try to heal (if 
>> not already healed by SHD) when the read comes on this, hence slowing down 
>> the client.
>> 
>>> Turning off client side healing doesn't compromise data integrity and 
>>> consistency. During the read request from client, pending xattr is 
>>> evaluated for replica copies and read is only served from correct copy. 
>>> During writes, IO will continue on both the replicas, SHD will take care of 
>>> healing files.
>>> After replacing the brick, we strongly recommend you to consider upgrading 
>>> your gluster to one of the maintained versions. We have many stability 
>>> related fixes there, which can handle some critical issues and corner cases 
>>> which you could hit during these kind of scenarios.
>> 
>> This will be first priority in infrastructure after fixing this cluster back 
>> to fully functional replica3. I will upgrade to 3.12.x and then to version 5 
>> or 6.
>> Sounds good.
>> 
>> If you are planning to have the same name for the new brick and if you get 
>> the error like "Brick may be containing or be contained by an existing 
>> brick" even after using the force option, try  using a different name. That 
>> should work.
>> 
>> Regards,
>> Karthik 
>> 
>> BR, 
>> Martin
>> 
>>> Regards,
>>> Karthik
>>> I had some problems in past when healing was triggered, VM disks became 
>>> unresponsive because healing took most of I/O. My volume containing only 
>>> big files with VM disks.
>>> 
>>> Thanks for suggestions.
>>> BR, 
>>> Martin
>>> 
>>>> On 10 Apr 2019, at 12:38, Martin Toth <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Thanks, this looks ok to me, I will reset brick because I don't have any 
>>>> data anymore on failed node so I can use same path / brick name.
>>>> 
>>>> Is reseting brick dangerous command? Should I be worried about some 
>>>> possible failure that will impact remaining two nodes? I am running really 
>>>> old 3.7.6 but stable version.
>>>> 
>>>> Thanks,
>>>> BR!
>>>> 
>>>> Martin
>>>>  
>>>> 
>>>>> On 10 Apr 2019, at 12:20, Karthik Subrahmanya <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Hi Martin,
>>>>> 
>>>>> After you add the new disks and creating raid array, you can run the 
>>>>> following command to replace the old brick with new one:
>>>>> 
>>>>> - If you are going to use a different name to the new brick you can run
>>>>> gluster volume replace-brick <volname> <old-brick> <new-brick> commit 
>>>>> force
>>>>> 
>>>>> - If you are planning to use the same name for the new brick as well then 
>>>>> you can use
>>>>> gluster volume reset-brick <volname> <old-brick> <new-brick> commit force
>>>>> Here old-brick & new-brick's hostname &  path should be same.
>>>>> 
>>>>> After replacing the brick, make sure the brick comes online using volume 
>>>>> status.
>>>>> Heal should automatically start, you can check the heal status to see all 
>>>>> the files gets replicated to the newly added brick. If it does not start 
>>>>> automatically, you can manually start that by running gluster volume heal 
>>>>> <volname>.
>>>>> 
>>>>> HTH,
>>>>> Karthik
>>>>> 
>>>>> On Wed, Apr 10, 2019 at 3:13 PM Martin Toth <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> Hi all,
>>>>> 
>>>>> I am running replica 3 gluster with 3 bricks. One of my servers failed - 
>>>>> all disks are showing errors and raid is in fault state.
>>>>> 
>>>>> Type: Replicate
>>>>> Volume ID: 41d5c283-3a74-4af8-a55d-924447bfa59a
>>>>> Status: Started
>>>>> Number of Bricks: 1 x 3 = 3
>>>>> Transport-type: tcp
>>>>> Bricks:
>>>>> Brick1: node1.san:/tank/gluster/gv0imagestore/brick1
>>>>> Brick2: node2.san:/tank/gluster/gv0imagestore/brick1 <— this brick is down
>>>>> Brick3: node3.san:/tank/gluster/gv0imagestore/brick1
>>>>> 
>>>>> So one of my bricks is totally failed (node2). It went down and all data 
>>>>> are lost (failed raid on node2). Now I am running only two bricks on 2 
>>>>> servers out from 3.
>>>>> This is really critical problem for us, we can lost all data. I want to 
>>>>> add new disks to node2, create new raid array on them and try to replace 
>>>>> failed brick on this node. 
>>>>> 
>>>>> What is the procedure of replacing Brick2 on node2, can someone advice? I 
>>>>> can’t find anything relevant in documentation.
>>>>> 
>>>>> Thanks in advance,
>>>>> Martin
>>>>> _______________________________________________
>>>>> Gluster-users mailing list
>>>>> [email protected] <mailto:[email protected]>
>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users 
>>>>> <https://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
[email protected]
https://lists.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Replica 3 - how to replace failed node (peer)

Reply via email to