Hello, 

> All volfiles are autogenerated based on the info available in the other files
> in /var/lib/glusterd/vols/<name>/ (like ./info, ./bricks/*). So to manually 
> fix
> your "situation", please make sure the contents in the files ./info,
> ./node_state.info ./rbstate ./bricks/* are "proper" (you can either share
> them with me offline, or compare them with another volume which is
I will consult with management about sharing. However I have found no obvious 
differencies in the .vol files. Bricks with faulty servers are explicitly 
defined in them the same way as proper bricks. 

> good), and issue a "gluster volume reset <volname>" to re-write fresh
> volfiles.
Is this really correct command? According to help it should reset configuration 
options. 
volume reset <VOLNAME> [option] [force] - reset all the reconfigured options

> It is also a good idea to double check the contents of
> /var/lib/glusterd/peers/* is proper too.
Only dead 00022 server  is missing as peer probe is not able to probe it. 00031 
was reattached (but still not recognized as part of volume).

> Doing these manual steps and restarting all processes should recover you
> from pretty much any situation.
Yeah, I thought so. However after any attempt to edit volume refused to start 
(glusterfs-server failed to start), complaining about uknown keys and listing 
brick numbers from info file. 

 
> Back to the cause of the problem - it appears to be the case that the ongoing
> replace-brick got messed up when yet another server died.
I believe it got messed (or messed finally) when I was attempting desperately 
to remove bricks and issued 
gluster peer detach 00022 force
gluster peer detach 00031 force
Hoping that this would allow  to break migration in progress and then 
remove/replace those servers. 

> A different way of achieving what you want, is to use add-brick + remove-
> brick for decommissioning servers (i.e, add-brick the new server
> - 00028, and "remove-brick start" the old one - 00031, and "remove-brick
> commit" once all the data has drained out). Moving forward this will be the
> recommended way to decommission servers. Use replace-brick to only
> replace an already dead server - 00022 with its replacement).

I am using distributed - replicated volume so I can only add/remove servers in 
replica pairs. Also command cluster brick-remove has issues with open files. At 
least this occurs with semiosis package. Any active KVM virtual instance COW  
files (base disk + diff) gets corrupted as soon as the following commands 
touches the data:
brick-remove 
rebalance
What I have observed is that corruptions occurs even for the base disk file, 
which is shared in OpenStack by number of instances so single file corruption 
will cause fault on multiple VMs and recovery can be impossible due to 
instances will get corrupt data to their diff files and replacing base file 
with proper one will not help for them. I have tested this several times and 
found that replace-brick  for some reason is working properly and does not 
cause issues for openfiles. 

> 
> > I am using Semiosis 3.3.1 package on Ubuntu 12.04:
> > dpkg -l | grep gluster
> > rc  glusterfs                        3.3.0-1
> >         clustered file-system
> > ii  glusterfs-client                 3.3.1-ubuntu1~precise8
> >          clustered file-system (client package)
> > ii  glusterfs-common                 3.3.1-ubuntu1~precise8
> >          GlusterFS common libraries and translator modules
> > ii  glusterfs-server                 3.3.1-ubuntu1~precise8
> >          clustered file-system (server package)

I have attempted to run glusterfs-server in debug mode and  saw the following 
when I have attempted to replace brick with force. It seems that Gluster is 
unable to force volume change command if one of the nodes does not respond. 
Even when force mode is issued. I would expect "force" should ignore such 
issues, especially when change is not related to the replica set, which node 
does not responds. 

However in the end I do receive the following error, which does not seem to 
relate to the log:
brick: 00031:/mnt/vmstore/brick does not exist in volume: glustervmstore

In my case it is
00031 -- 00036 --- I am replacing 00031 with spare 00028
00022 -- 00024 -- 00022 have had disk failure and system is offline and is 
unable to respond. 

[2013-06-19 09:56:21.520991] D [glusterd-utils.c:941:glusterd_volinfo_find] 0-: 
Volume glustervmstore found
[2013-06-19 09:56:21.521014] D [glusterd-utils.c:949:glusterd_volinfo_find] 0-: 
Returning 0
[2013-06-19 09:56:21.521060] D [glusterd-utils.c:727:glusterd_brickinfo_new] 
0-: Returning 0
[2013-06-19 09:56:21.521095] D 
[glusterd-utils.c:783:glusterd_brickinfo_from_brick] 0-: Returning 0
[2013-06-19 09:56:21.521126] D [glusterd-utils.c:585:glusterd_volinfo_new] 0-: 
Returning 0
[2013-06-19 09:56:21.521170] D 
[glusterd-utils.c:672:glusterd_volume_brickinfos_delete] 0-: Returning 0
[2013-06-19 09:56:21.521201] D [glusterd-utils.c:701:glusterd_volinfo_delete] 
0-: Returning 0
[2013-06-19 09:56:21.521233] D [glusterd-utils.c:727:glusterd_brickinfo_new] 
0-: Returning 0
[2013-06-19 09:56:21.521261] D 
[glusterd-utils.c:783:glusterd_brickinfo_from_brick] 0-: Returning 0
[2013-06-19 09:56:21.521290] D [glusterd-utils.c:585:glusterd_volinfo_new] 0-: 
Returning 0
[2013-06-19 09:56:21.521322] D 
[glusterd-utils.c:672:glusterd_volume_brickinfos_delete] 0-: Returning 0
[2013-06-19 09:56:21.521350] D [glusterd-utils.c:701:glusterd_volinfo_delete] 
0-: Returning 0
[2013-06-19 09:56:21.521385] D [glusterd-utils.c:4344:glusterd_is_rb_started] 
0-: is_rb_started:status=0
[2013-06-19 09:56:21.521417] I 
[glusterd-utils.c:857:glusterd_volume_brickinfo_get_by_brick] 0-: brick: 
00031:/mnt/vmstore/brick
[2013-06-19 09:56:21.521457] D 
[glusterd-utils.c:4115:glusterd_friend_find_by_hostname] 0-management: Friend 
0031 found.. state: 3
[2013-06-19 09:56:21.521485] D 
[glusterd-utils.c:4198:glusterd_hostname_to_uuid] 0-: returning 0

[2013-06-19 09:56:21.524381] D 
[glusterd-utils.c:4164:glusterd_friend_find_by_hostname] 0-management: Unable 
to find friend: 00022

[2013-06-19 09:56:21.525286] D [glusterd-utils.c:234:glusterd_is_local_addr] 
0-management: 10.x.x.x
[2013-06-19 09:56:21.525346] D [glusterd-utils.c:234:glusterd_is_local_addr] 
0-management: 10.x.x.x
[2013-06-19 09:56:21.525371] D [glusterd-utils.c:234:glusterd_is_local_addr] 
0-management: 10.x.x.x

[2013-06-19 09:56:21.525392] D [glusterd-utils.c:255:glusterd_is_local_addr] 
0-management: 00022 is not local

[2013-06-19 09:56:21.525407] D 
[glusterd-utils.c:4198:glusterd_hostname_to_uuid] 0-: returning 1
[2013-06-19 09:56:21.525421] D [glusterd-utils.c:739:glusterd_resolve_brick] 
0-: Returning 1
[2013-06-19 09:56:21.525434] D 
[glusterd-utils.c:838:glusterd_volume_brickinfo_get] 0-: Returning -1
[2013-06-19 09:56:21.525447] D 
[glusterd-utils.c:881:glusterd_volume_brickinfo_get_by_brick] 0-: Returning -1
[2013-06-19 09:56:21.525464] D 
[glusterd-replace-brick.c:504:glusterd_op_stage_replace_brick] 0-: Returning -1
[2013-06-19 09:56:21.525477] D 
[glusterd-op-sm.c:2968:glusterd_op_stage_validate] 0-: Returning -1

[2013-06-19 09:56:21.525491] E 
[glusterd-op-sm.c:1999:glusterd_op_ac_send_stage_op] 0-: Staging failed

[2013-06-19 09:56:21.525507] D 
[glusterd-op-sm.c:4539:glusterd_op_sm_inject_event] 0-glusterd: Enqueue event: 
'GD_OP_EVENT_RCVD_RJT'
[2013-06-19 09:56:21.525521] I 
[glusterd-op-sm.c:2039:glusterd_op_ac_send_stage_op] 0-glusterd: Sent op req to 
0 peers
[2013-06-19 09:56:21.525535] D 
[glusterd-op-sm.c:4539:glusterd_op_sm_inject_event] 0-glusterd: Enqueue event: 
'GD_OP_EVENT_ALL_ACC'
[2013-06-19 09:56:21.525549] D 
[glusterd-op-sm.c:144:glusterd_op_sm_inject_all_acc] 0-: Returning 0
[2013-06-19 09:56:21.525561] D 
[glusterd-op-sm.c:2044:glusterd_op_ac_send_stage_op] 0-: Returning with 0
[2013-06-19 09:56:21.525575] D 
[glusterd-utils.c:4719:glusterd_sm_tr_log_transition_add] 0-glusterd: 
Transitioning from 'Lock sent' to 'Stage op sent' due to event 
'GD_OP_EVENT_ALL_ACC'
This e-mail and any attachments are confidential and intended
solely for the addressee and may also be privileged or exempt from
disclosure under applicable law. If you are not the addressee, or
have received this e-mail in error, please notify the sender
immediately, delete it from your system and do not copy, disclose
or otherwise act upon any part of this e-mail or its attachments.

Internet communications are not guaranteed to be secure or
virus-free.
The Barclays Group does not accept responsibility for any loss
arising from unauthorised access to, or interference with, any
Internet communications by any third party, or from the
transmission of any viruses. Replies to this e-mail may be
monitored by the Barclays Group for operational or business
reasons.

Any opinion or other information in this e-mail or its attachments
that does not relate to the business of the Barclays Group is
personal to the sender and is not given or endorsed by the Barclays
Group.

Barclays Bank PLC. Registered in England and Wales (registered no.
1026167).
Registered Office: 1 Churchill Place, London, E14 5HP, United
Kingdom.

Barclays Bank PLC is authorised by the Prudential Regulation
Authority and regulated by the Financial Conduct Authority and the
Prudential Regulation Authority (Financial Services Register No.
122702).
_______________________________________________
Gluster-users mailing list
[email protected]
http://supercolony.gluster.org/mailman/listinfo/gluster-users

Reply via email to