Hi Atin, This is the steps exactly I have done which caused failure. additional to this node3 OS drive was running out of space when service failed. so I have cleared some space in OS drive but still service failed to start.
Trying to simulate a situation. where volume stoped abnormally and entire cluster restarted with some missing disks. My test cluster is set up with 3 nodes and each has four disks, I have setup a volume with disperse 4+2. In Node-3 2 disks have failed, to replace I have shutdown all system below are the steps done. 1. umount from client machine 2. shutdown all system by running `shutdown -h now` command ( without stopping volume and stop service) 3. replace faulty disk in Node-3 4. powered ON all system 5. format replaced drives, and mount all drives 6. start glusterd service in all node (success) 7. Now running `voulume status` command from node-3 output : [2019-01-15 16:52:17.718422] : v status : FAILED : Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please check log file for details. 8. running `voulume start gfs-tst` command from node-3 output : [2019-01-15 16:53:19.410252] : v start gfs-tst : FAILED : Volume gfs-tst already started 9. running `gluster v status` in other node. showing all brick available but 'self-heal daemon' not running @gfstst-node2:~$ sudo gluster v status Status of volume: gfs-tst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 Self-heal Daemon on localhost N/A N/A Y 2662 Self-heal Daemon on IP.4 N/A N/A Y 2786 10. in the above output 'volume already started'. so, running `reset-brick` command v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : FAILED : /media/disk3/brick3 is already part of a volume 11. reset-brick command was not working, so, tried stopping volume and start with force command output : [2019-01-15 17:01:04.570794] : v start gfs-tst force : FAILED : Pre-validation failed on localhost. Please check log file for details 12. now stopped service in all node and tried starting again. except node-3 other nodes service started successfully without any issues. in node-3 receiving following message. sudo service glusterd start * Starting glusterd service glusterd [fail] /usr/local/sbin/glusterd: option requires an argument -- 'f' Try `glusterd --help' or `glusterd --usage' for more information. 13. checking glusterd log file found that OS drive was running out of space output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] [store.c:372:gf_store_save_value] 0-management: fflush failed. [No space left on device] [2019-01-15 16:51:37.210874] E [MSGID: 106190] [glusterd-store.c:1058:glusterd_volume_exclude_options_write] 0-management: Unable to write volume values for gfs-tst 14. cleared some space in OS drive but still, service is not running. below is the error logged in glusterd.log [2019-01-15 17:50:13.956053] I [MSGID: 100030] [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd -p /var/run/glusterd.pid) [2019-01-15 17:50:13.960131] I [MSGID: 106478] [glusterd.c:1423:init] 0-management: Maximum allowed open file descriptors set to 65536 [2019-01-15 17:50:13.960193] I [MSGID: 106479] [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working directory [2019-01-15 17:50:13.960212] I [MSGID: 106479] [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file working directory [2019-01-15 17:50:13.964437] W [MSGID: 103071] [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event channel creation failed [No such device] [2019-01-15 17:50:13.964474] W [MSGID: 103055] [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device [2019-01-15 17:50:13.964491] W [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' initialization failed [2019-01-15 17:50:13.964560] W [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create listener, initing the transport failed [2019-01-15 17:50:13.964579] E [MSGID: 106244] [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, continuing with succeeded transport [2019-01-15 17:50:14.967681] I [MSGID: 106513] [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 40100 [2019-01-15 17:50:14.973931] I [MSGID: 106544] [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: d6bf51a7-c296-492f-8dac-e81efa9dd22d [2019-01-15 17:50:15.046620] E [MSGID: 101032] [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. [No such file or directory] [2019-01-15 17:50:15.046685] E [MSGID: 106201] [glusterd-store.c:3384:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: gfs-tst [2019-01-15 17:50:15.046718] E [MSGID: 101019] [xlator.c:720:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again [2019-01-15 17:50:15.046732] E [MSGID: 101066] [graph.c:367:glusterfs_graph_init] 0-management: initializing translator failed [2019-01-15 17:50:15.046741] E [MSGID: 101176] [graph.c:738:glusterfs_graph_activate] 0-graph: init failed [2019-01-15 17:50:15.047171] W [glusterfsd.c:1514:cleanup_and_exit] (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: received signum (-1), shutting down 15. In other node running `volume status' @gfstst-node2:~$ sudo gluster v status Status of volume: gfs-tst Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick IP.2:/media/disk1/brick1 49152 0 Y 1517 Brick IP.4:/media/disk1/brick1 49152 0 Y 1668 Brick IP.2:/media/disk2/brick2 49153 0 Y 1522 Brick IP.4:/media/disk2/brick2 49153 0 Y 1678 Brick IP.2:/media/disk3/brick3 49154 0 Y 1527 Brick IP.4:/media/disk3/brick3 49154 0 Y 1677 Brick IP.2:/media/disk4/brick4 49155 0 Y 1541 Brick IP.4:/media/disk4/brick4 49155 0 Y 1683 Self-heal Daemon on localhost N/A N/A Y 2662 Self-heal Daemon on IP.4 N/A N/A Y 2786 Task Status of Volume gfs-tst ------------------------------------------------------------------------------ There are no active volume tasks 16. 'peer status' command showing node-3 disconnected root@gfstst-node2:~$ sudo gluster pool list UUID Hostname State d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 Disconnected c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected 0083ec0c-40bf-472a-a128-458924e56c96 localhost Connected root@gfstst-node2:~$ sudo gluster peer status Number of Peers: 2 Hostname: IP.3 Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d State: Peer in Cluster (Disconnected) Hostname: IP.4 Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 State: Peer in Cluster (Connected) regards Amudhan On Thu, Jan 31, 2019 at 8:54 AM Atin Mukherjee <amukh...@redhat.com> wrote: > I'm not very sure how did you end up into a state where in one of the node > lost information of one peer from the cluster. I suspect doing a replace > node operation you somehow landed into this situation by an incorrect step. > Until and unless you could elaborate more on what all steps you have > performed in the cluster, it'd be difficult to figure out the exact cause. > > On Wed, Jan 30, 2019 at 7:25 PM Amudhan P <amudha...@gmail.com> wrote: > >> Hi Atin, >> >> yes, it worked out thank you. >> >> what would be the cause of this issue? >> >> >> >> On Fri, Jan 25, 2019 at 1:56 PM Atin Mukherjee <amukh...@redhat.com> >> wrote: >> >>> Amudhan, >>> >>> So here's the issue: >>> >>> In node3, 'cat /var/lib/glusterd/peers/* ' doesn't show up node2's >>> details and that's why glusterd wasn't able to resolve the brick(s) hosted >>> on node2. >>> >>> Can you please pick up 0083ec0c-40bf-472a-a128-458924e56c96 file from >>> /var/lib/glusterd/peers/ from node 4 and place it in the same location in >>> node 3 and then restart glusterd service on node 3? >>> >>> >>> On Thu, Jan 24, 2019 at 11:57 AM Amudhan P <amudha...@gmail.com> wrote: >>> >>>> Atin, >>>> >>>> Sorry, i missed to send entire `glusterd` folder. Now attached zip >>>> contains `glusterd` folder from all nodes. >>>> >>>> the problem node is node3 IP 10.1.2.3, `glusterd` log file is inside >>>> node3 folder. >>>> >>>> regards >>>> Amudhan >>>> >>>> On Wed, Jan 23, 2019 at 11:02 PM Atin Mukherjee <amukh...@redhat.com> >>>> wrote: >>>> >>>>> Amudhan, >>>>> >>>>> I see that you have provided the content of the configuration of the >>>>> volume gfs-tst where the request was to share the dump of >>>>> /var/lib/glusterd/* . I can not debug this further until you share the >>>>> correct dump. >>>>> >>>>> On Thu, Jan 17, 2019 at 3:43 PM Atin Mukherjee <amukh...@redhat.com> >>>>> wrote: >>>>> >>>>>> Can you please run 'glusterd -LDEBUG' and share back the >>>>>> glusterd.log? Instead of doing too many back and forth I suggest you to >>>>>> share the content of /var/lib/glusterd from all the nodes. Also do >>>>>> mention >>>>>> which particular node the glusterd service is unable to come up. >>>>>> >>>>>> On Thu, Jan 17, 2019 at 11:34 AM Amudhan P <amudha...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I have created the folder in the path as said but still, service >>>>>>> failed to start below is the error msg in glusterd.log >>>>>>> >>>>>>> [2019-01-16 14:50:14.555742] I [MSGID: 100030] >>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: /usr/local/sbin/glusterd >>>>>>> -p >>>>>>> /var/run/glusterd.pid) >>>>>>> [2019-01-16 14:50:14.559835] I [MSGID: 106478] >>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file >>>>>>> descriptors >>>>>>> set to 65536 >>>>>>> [2019-01-16 14:50:14.559894] I [MSGID: 106479] >>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as working >>>>>>> directory >>>>>>> [2019-01-16 14:50:14.559912] I [MSGID: 106479] >>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid file >>>>>>> working directory >>>>>>> [2019-01-16 14:50:14.563834] W [MSGID: 103071] >>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm event >>>>>>> channel creation failed [No such device] >>>>>>> [2019-01-16 14:50:14.563867] W [MSGID: 103055] [rdma.c:4938:init] >>>>>>> 0-rdma.management: Failed to initialize IB Device >>>>>>> [2019-01-16 14:50:14.563882] W >>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>> initialization failed >>>>>>> [2019-01-16 14:50:14.563957] W >>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>> listener, initing the transport failed >>>>>>> [2019-01-16 14:50:14.563974] E [MSGID: 106244] >>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>> continuing with succeeded transport >>>>>>> [2019-01-16 14:50:15.565868] I [MSGID: 106513] >>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: >>>>>>> retrieved >>>>>>> op-version: 40100 >>>>>>> [2019-01-16 14:50:15.642532] I [MSGID: 106544] >>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>> [2019-01-16 14:50:15.675333] I [MSGID: 106498] >>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] >>>>>>> 0-management: >>>>>>> connect returned 0 >>>>>>> [2019-01-16 14:50:15.675421] W [MSGID: 106061] >>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] >>>>>>> 0-glusterd: >>>>>>> Failed to get tcp-user-timeout >>>>>>> [2019-01-16 14:50:15.675451] I >>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>> frame-timeout to 600 >>>>>>> *[2019-01-16 14:50:15.676912] E [MSGID: 106187] >>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: resolve >>>>>>> brick failed in restore* >>>>>>> *[2019-01-16 14:50:15.676956] E [MSGID: 101019] >>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>> 'management' failed, review your volfile again* >>>>>>> [2019-01-16 14:50:15.676973] E [MSGID: 101066] >>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing translator >>>>>>> failed >>>>>>> [2019-01-16 14:50:15.676986] E [MSGID: 101176] >>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>> [2019-01-16 14:50:15.677479] W [glusterfsd.c:1514:cleanup_and_exit] >>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) [0x409e41] >>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>> received signum (-1), shutting down >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 17, 2019 at 8:06 AM Atin Mukherjee <amukh...@redhat.com> >>>>>>> wrote: >>>>>>> >>>>>>>> If gluster volume info/status shows the brick to be >>>>>>>> /media/disk4/brick4 then you'd need to mount the same path and hence >>>>>>>> you'd >>>>>>>> need to create the brick4 directory explicitly. I fail to understand >>>>>>>> the >>>>>>>> rationale how only /media/disk4 can be used as the mount path for the >>>>>>>> brick. >>>>>>>> >>>>>>>> On Wed, Jan 16, 2019 at 5:24 PM Amudhan P <amudha...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yes, I did mount bricks but the folder 'brick4' was still not >>>>>>>>> created inside the brick. >>>>>>>>> Do I need to create this folder because when I run replace-brick >>>>>>>>> it will create folder inside the brick. I have seen this behavior >>>>>>>>> before >>>>>>>>> when running replace-brick or heal begins. >>>>>>>>> >>>>>>>>> On Wed, Jan 16, 2019 at 5:05 PM Atin Mukherjee < >>>>>>>>> amukh...@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 16, 2019 at 5:02 PM Amudhan P <amudha...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Atin, >>>>>>>>>>> I have copied the content of 'gfs-tst' from vol folder in >>>>>>>>>>> another node. when starting service again fails with error msg in >>>>>>>>>>> glusterd.log file. >>>>>>>>>>> >>>>>>>>>>> [2019-01-15 20:16:59.513023] I [MSGID: 100030] >>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started running >>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: >>>>>>>>>>> /usr/local/sbin/glusterd -p >>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>> [2019-01-15 20:16:59.517164] I [MSGID: 106478] >>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file >>>>>>>>>>> descriptors >>>>>>>>>>> set to 65536 >>>>>>>>>>> [2019-01-15 20:16:59.517264] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as >>>>>>>>>>> working >>>>>>>>>>> directory >>>>>>>>>>> [2019-01-15 20:16:59.517283] I [MSGID: 106479] >>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as pid >>>>>>>>>>> file >>>>>>>>>>> working directory >>>>>>>>>>> [2019-01-15 20:16:59.521508] W [MSGID: 103071] >>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm >>>>>>>>>>> event >>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>> [2019-01-15 20:16:59.521544] W [MSGID: 103055] >>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB Device >>>>>>>>>>> [2019-01-15 20:16:59.521562] W >>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>> initialization failed >>>>>>>>>>> [2019-01-15 20:16:59.521629] W >>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot create >>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>> [2019-01-15 20:16:59.521648] E [MSGID: 106244] >>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners failed, >>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>> [2019-01-15 20:17:00.529390] I [MSGID: 106513] >>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: >>>>>>>>>>> retrieved >>>>>>>>>>> op-version: 40100 >>>>>>>>>>> [2019-01-15 20:17:00.608354] I [MSGID: 106544] >>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>> [2019-01-15 20:17:00.650911] W [MSGID: 106425] >>>>>>>>>>> [glusterd-store.c:2643:glusterd_store_retrieve_bricks] >>>>>>>>>>> 0-management: failed >>>>>>>>>>> to get statfs() call on brick /media/disk4/brick4 [No such file or >>>>>>>>>>> directory] >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> This means that underlying brick /media/disk4/brick4 doesn't >>>>>>>>>> exist. You already mentioned that you had replaced the faulty disk, >>>>>>>>>> but >>>>>>>>>> have you not mounted it yet? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> [2019-01-15 20:17:00.691240] I [MSGID: 106498] >>>>>>>>>>> [glusterd-handler.c:3614:glusterd_friend_add_from_peerinfo] >>>>>>>>>>> 0-management: >>>>>>>>>>> connect returned 0 >>>>>>>>>>> [2019-01-15 20:17:00.691307] W [MSGID: 106061] >>>>>>>>>>> [glusterd-handler.c:3408:glusterd_transport_inet_options_build] >>>>>>>>>>> 0-glusterd: >>>>>>>>>>> Failed to get tcp-user-timeout >>>>>>>>>>> [2019-01-15 20:17:00.691331] I >>>>>>>>>>> [rpc-clnt.c:1059:rpc_clnt_connection_init] 0-management: setting >>>>>>>>>>> frame-timeout to 600 >>>>>>>>>>> [2019-01-15 20:17:00.692547] E [MSGID: 106187] >>>>>>>>>>> [glusterd-store.c:4662:glusterd_resolve_all_bricks] 0-glusterd: >>>>>>>>>>> resolve >>>>>>>>>>> brick failed in restore >>>>>>>>>>> [2019-01-15 20:17:00.692582] E [MSGID: 101019] >>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>> [2019-01-15 20:17:00.692597] E [MSGID: 101066] >>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing >>>>>>>>>>> translator >>>>>>>>>>> failed >>>>>>>>>>> [2019-01-15 20:17:00.692607] E [MSGID: 101176] >>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>> [2019-01-15 20:17:00.693004] W >>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) [0x409f52] >>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) >>>>>>>>>>> [0x409e41] >>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) 0-: >>>>>>>>>>> received signum (-1), shutting down >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 16, 2019 at 4:34 PM Atin Mukherjee < >>>>>>>>>>> amukh...@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> This is a case of partial write of a transaction and as the >>>>>>>>>>>> host ran out of space for the root partition where all the >>>>>>>>>>>> glusterd related >>>>>>>>>>>> configurations are persisted, the transaction couldn't be written >>>>>>>>>>>> and hence >>>>>>>>>>>> the new (replaced) brick's information wasn't persisted in the >>>>>>>>>>>> configuration. The workaround for this is to copy the content of >>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/ from one of the nodes in the >>>>>>>>>>>> trusted >>>>>>>>>>>> storage pool to the node where glusterd service fails to come up >>>>>>>>>>>> and post >>>>>>>>>>>> that restarting the glusterd service should be able to make peer >>>>>>>>>>>> status >>>>>>>>>>>> reporting all nodes healthy and connected. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 16, 2019 at 3:49 PM Amudhan P <amudha...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> In short, when I started glusterd service I am getting >>>>>>>>>>>>> following error msg in the glusterd.log file in one server. >>>>>>>>>>>>> what needs to be done? >>>>>>>>>>>>> >>>>>>>>>>>>> error logged in glusterd.log >>>>>>>>>>>>> >>>>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started >>>>>>>>>>>>> running >>>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: >>>>>>>>>>>>> /usr/local/sbin/glusterd -p >>>>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file >>>>>>>>>>>>> descriptors >>>>>>>>>>>>> set to 65536 >>>>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as >>>>>>>>>>>>> working >>>>>>>>>>>>> directory >>>>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as >>>>>>>>>>>>> pid file >>>>>>>>>>>>> working directory >>>>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm >>>>>>>>>>>>> event >>>>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB >>>>>>>>>>>>> Device >>>>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>>>> initialization failed >>>>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot >>>>>>>>>>>>> create >>>>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners >>>>>>>>>>>>> failed, >>>>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: >>>>>>>>>>>>> retrieved >>>>>>>>>>>>> op-version: 40100 >>>>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. >>>>>>>>>>>>> [No such >>>>>>>>>>>>> file or directory] >>>>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] >>>>>>>>>>>>> 0-management: >>>>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing >>>>>>>>>>>>> translator >>>>>>>>>>>>> failed >>>>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> In long, I am trying to simulate a situation. where volume >>>>>>>>>>>>> stoped abnormally and >>>>>>>>>>>>> entire cluster restarted with some missing disks. >>>>>>>>>>>>> >>>>>>>>>>>>> My test cluster is set up with 3 nodes and each has four >>>>>>>>>>>>> disks, I have setup a volume with disperse 4+2. >>>>>>>>>>>>> In Node-3 2 disks have failed, to replace I have shutdown all >>>>>>>>>>>>> system >>>>>>>>>>>>> >>>>>>>>>>>>> below are the steps done. >>>>>>>>>>>>> >>>>>>>>>>>>> 1. umount from client machine >>>>>>>>>>>>> 2. shutdown all system by running `shutdown -h now` command ( >>>>>>>>>>>>> without stopping volume and stop service) >>>>>>>>>>>>> 3. replace faulty disk in Node-3 >>>>>>>>>>>>> 4. powered ON all system >>>>>>>>>>>>> 5. format replaced drives, and mount all drives >>>>>>>>>>>>> 6. start glusterd service in all node (success) >>>>>>>>>>>>> 7. Now running `voulume status` command from node-3 >>>>>>>>>>>>> output : [2019-01-15 16:52:17.718422] : v status : FAILED : >>>>>>>>>>>>> Staging failed on 0083ec0c-40bf-472a-a128-458924e56c96. Please >>>>>>>>>>>>> check log >>>>>>>>>>>>> file for details. >>>>>>>>>>>>> 8. running `voulume start gfs-tst` command from node-3 >>>>>>>>>>>>> output : [2019-01-15 16:53:19.410252] : v start gfs-tst : >>>>>>>>>>>>> FAILED : Volume gfs-tst already started >>>>>>>>>>>>> >>>>>>>>>>>>> 9. running `gluster v status` in other node. showing all brick >>>>>>>>>>>>> available but 'self-heal daemon' not running >>>>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>>>> Gluster process TCP Port RDMA >>>>>>>>>>>>> Port Online Pid >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 >>>>>>>>>>>>> Y 1517 >>>>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 >>>>>>>>>>>>> Y 1668 >>>>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 >>>>>>>>>>>>> Y 1522 >>>>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 >>>>>>>>>>>>> Y 1678 >>>>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 >>>>>>>>>>>>> Y 1527 >>>>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 >>>>>>>>>>>>> Y 1677 >>>>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 >>>>>>>>>>>>> Y 1541 >>>>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 >>>>>>>>>>>>> Y 1683 >>>>>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>>>>> Y 2662 >>>>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A >>>>>>>>>>>>> Y 2786 >>>>>>>>>>>>> >>>>>>>>>>>>> 10. in the above output 'volume already started'. so, running >>>>>>>>>>>>> `reset-brick` command >>>>>>>>>>>>> v reset-brick gfs-tst IP.3:/media/disk3/brick3 >>>>>>>>>>>>> IP.3:/media/disk3/brick3 commit force >>>>>>>>>>>>> >>>>>>>>>>>>> output : [2019-01-15 16:57:37.916942] : v reset-brick gfs-tst >>>>>>>>>>>>> IP.3:/media/disk3/brick3 IP.3:/media/disk3/brick3 commit force : >>>>>>>>>>>>> FAILED : >>>>>>>>>>>>> /media/disk3/brick3 is already part of a volume >>>>>>>>>>>>> >>>>>>>>>>>>> 11. reset-brick command was not working, so, tried stopping >>>>>>>>>>>>> volume and start with force command >>>>>>>>>>>>> output : [2019-01-15 17:01:04.570794] : v start gfs-tst force >>>>>>>>>>>>> : FAILED : Pre-validation failed on localhost. Please check log >>>>>>>>>>>>> file for >>>>>>>>>>>>> details >>>>>>>>>>>>> >>>>>>>>>>>>> 12. now stopped service in all node and tried starting again. >>>>>>>>>>>>> except node-3 other nodes service started successfully without >>>>>>>>>>>>> any issues. >>>>>>>>>>>>> >>>>>>>>>>>>> in node-3 receiving following message. >>>>>>>>>>>>> >>>>>>>>>>>>> sudo service glusterd start >>>>>>>>>>>>> * Starting glusterd service glusterd >>>>>>>>>>>>> >>>>>>>>>>>>> [fail] >>>>>>>>>>>>> /usr/local/sbin/glusterd: option requires an argument -- 'f' >>>>>>>>>>>>> Try `glusterd --help' or `glusterd --usage' for more >>>>>>>>>>>>> information. >>>>>>>>>>>>> >>>>>>>>>>>>> 13. checking glusterd log file found that OS drive was running >>>>>>>>>>>>> out of space >>>>>>>>>>>>> output : [2019-01-15 16:51:37.210792] W [MSGID: 101012] >>>>>>>>>>>>> [store.c:372:gf_store_save_value] 0-management: fflush failed. >>>>>>>>>>>>> [No space >>>>>>>>>>>>> left on device] >>>>>>>>>>>>> [2019-01-15 16:51:37.210874] E [MSGID: 106190] >>>>>>>>>>>>> [glusterd-store.c:1058:glusterd_volume_exclude_options_write] >>>>>>>>>>>>> 0-management: >>>>>>>>>>>>> Unable to write volume values for gfs-tst >>>>>>>>>>>>> >>>>>>>>>>>>> 14. cleared some space in OS drive but still, service is not >>>>>>>>>>>>> running. below is the error logged in glusterd.log >>>>>>>>>>>>> >>>>>>>>>>>>> [2019-01-15 17:50:13.956053] I [MSGID: 100030] >>>>>>>>>>>>> [glusterfsd.c:2741:main] 0-/usr/local/sbin/glusterd: Started >>>>>>>>>>>>> running >>>>>>>>>>>>> /usr/local/sbin/glusterd version 4.1.6 (args: >>>>>>>>>>>>> /usr/local/sbin/glusterd -p >>>>>>>>>>>>> /var/run/glusterd.pid) >>>>>>>>>>>>> [2019-01-15 17:50:13.960131] I [MSGID: 106478] >>>>>>>>>>>>> [glusterd.c:1423:init] 0-management: Maximum allowed open file >>>>>>>>>>>>> descriptors >>>>>>>>>>>>> set to 65536 >>>>>>>>>>>>> [2019-01-15 17:50:13.960193] I [MSGID: 106479] >>>>>>>>>>>>> [glusterd.c:1481:init] 0-management: Using /var/lib/glusterd as >>>>>>>>>>>>> working >>>>>>>>>>>>> directory >>>>>>>>>>>>> [2019-01-15 17:50:13.960212] I [MSGID: 106479] >>>>>>>>>>>>> [glusterd.c:1486:init] 0-management: Using /var/run/gluster as >>>>>>>>>>>>> pid file >>>>>>>>>>>>> working directory >>>>>>>>>>>>> [2019-01-15 17:50:13.964437] W [MSGID: 103071] >>>>>>>>>>>>> [rdma.c:4629:__gf_rdma_ctx_create] 0-rpc-transport/rdma: rdma_cm >>>>>>>>>>>>> event >>>>>>>>>>>>> channel creation failed [No such device] >>>>>>>>>>>>> [2019-01-15 17:50:13.964474] W [MSGID: 103055] >>>>>>>>>>>>> [rdma.c:4938:init] 0-rdma.management: Failed to initialize IB >>>>>>>>>>>>> Device >>>>>>>>>>>>> [2019-01-15 17:50:13.964491] W >>>>>>>>>>>>> [rpc-transport.c:351:rpc_transport_load] 0-rpc-transport: 'rdma' >>>>>>>>>>>>> initialization failed >>>>>>>>>>>>> [2019-01-15 17:50:13.964560] W >>>>>>>>>>>>> [rpcsvc.c:1781:rpcsvc_create_listener] 0-rpc-service: cannot >>>>>>>>>>>>> create >>>>>>>>>>>>> listener, initing the transport failed >>>>>>>>>>>>> [2019-01-15 17:50:13.964579] E [MSGID: 106244] >>>>>>>>>>>>> [glusterd.c:1764:init] 0-management: creation of 1 listeners >>>>>>>>>>>>> failed, >>>>>>>>>>>>> continuing with succeeded transport >>>>>>>>>>>>> [2019-01-15 17:50:14.967681] I [MSGID: 106513] >>>>>>>>>>>>> [glusterd-store.c:2240:glusterd_restore_op_version] 0-glusterd: >>>>>>>>>>>>> retrieved >>>>>>>>>>>>> op-version: 40100 >>>>>>>>>>>>> [2019-01-15 17:50:14.973931] I [MSGID: 106544] >>>>>>>>>>>>> [glusterd.c:158:glusterd_uuid_init] 0-management: retrieved UUID: >>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>>> [2019-01-15 17:50:15.046620] E [MSGID: 101032] >>>>>>>>>>>>> [store.c:441:gf_store_handle_retrieve] 0-: Path corresponding to >>>>>>>>>>>>> /var/lib/glusterd/vols/gfs-tst/bricks/IP.3:-media-disk3-brick3. >>>>>>>>>>>>> [No such >>>>>>>>>>>>> file or directory] >>>>>>>>>>>>> [2019-01-15 17:50:15.046685] E [MSGID: 106201] >>>>>>>>>>>>> [glusterd-store.c:3384:glusterd_store_retrieve_volumes] >>>>>>>>>>>>> 0-management: >>>>>>>>>>>>> Unable to restore volume: gfs-tst >>>>>>>>>>>>> [2019-01-15 17:50:15.046718] E [MSGID: 101019] >>>>>>>>>>>>> [xlator.c:720:xlator_init] 0-management: Initialization of volume >>>>>>>>>>>>> 'management' failed, review your volfile again >>>>>>>>>>>>> [2019-01-15 17:50:15.046732] E [MSGID: 101066] >>>>>>>>>>>>> [graph.c:367:glusterfs_graph_init] 0-management: initializing >>>>>>>>>>>>> translator >>>>>>>>>>>>> failed >>>>>>>>>>>>> [2019-01-15 17:50:15.046741] E [MSGID: 101176] >>>>>>>>>>>>> [graph.c:738:glusterfs_graph_activate] 0-graph: init failed >>>>>>>>>>>>> [2019-01-15 17:50:15.047171] W >>>>>>>>>>>>> [glusterfsd.c:1514:cleanup_and_exit] >>>>>>>>>>>>> (-->/usr/local/sbin/glusterd(glusterfs_volumes_init+0xc2) >>>>>>>>>>>>> [0x409f52] >>>>>>>>>>>>> -->/usr/local/sbin/glusterd(glusterfs_process_volfp+0x151) >>>>>>>>>>>>> [0x409e41] >>>>>>>>>>>>> -->/usr/local/sbin/glusterd(cleanup_and_exit+0x5f) [0x40942f] ) >>>>>>>>>>>>> 0-: >>>>>>>>>>>>> received signum (-1), shutting down >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 15. In other node running `volume status' still shows bricks >>>>>>>>>>>>> node3 is live >>>>>>>>>>>>> but 'peer status' showing node-3 disconnected >>>>>>>>>>>>> >>>>>>>>>>>>> @gfstst-node2:~$ sudo gluster v status >>>>>>>>>>>>> Status of volume: gfs-tst >>>>>>>>>>>>> Gluster process TCP Port RDMA >>>>>>>>>>>>> Port Online Pid >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> Brick IP.2:/media/disk1/brick1 49152 0 >>>>>>>>>>>>> Y 1517 >>>>>>>>>>>>> Brick IP.4:/media/disk1/brick1 49152 0 >>>>>>>>>>>>> Y 1668 >>>>>>>>>>>>> Brick IP.2:/media/disk2/brick2 49153 0 >>>>>>>>>>>>> Y 1522 >>>>>>>>>>>>> Brick IP.4:/media/disk2/brick2 49153 0 >>>>>>>>>>>>> Y 1678 >>>>>>>>>>>>> Brick IP.2:/media/disk3/brick3 49154 0 >>>>>>>>>>>>> Y 1527 >>>>>>>>>>>>> Brick IP.4:/media/disk3/brick3 49154 0 >>>>>>>>>>>>> Y 1677 >>>>>>>>>>>>> Brick IP.2:/media/disk4/brick4 49155 0 >>>>>>>>>>>>> Y 1541 >>>>>>>>>>>>> Brick IP.4:/media/disk4/brick4 49155 0 >>>>>>>>>>>>> Y 1683 >>>>>>>>>>>>> Self-heal Daemon on localhost N/A N/A >>>>>>>>>>>>> Y 2662 >>>>>>>>>>>>> Self-heal Daemon on IP.4 N/A N/A >>>>>>>>>>>>> Y 2786 >>>>>>>>>>>>> >>>>>>>>>>>>> Task Status of Volume gfs-tst >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------------------------------------ >>>>>>>>>>>>> There are no active volume tasks >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> root@gfstst-node2:~$ sudo gluster pool list >>>>>>>>>>>>> UUID Hostname State >>>>>>>>>>>>> d6bf51a7-c296-492f-8dac-e81efa9dd22d IP.3 >>>>>>>>>>>>> Disconnected >>>>>>>>>>>>> c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 IP.4 Connected >>>>>>>>>>>>> 0083ec0c-40bf-472a-a128-458924e56c96 localhost >>>>>>>>>>>>> Connected >>>>>>>>>>>>> >>>>>>>>>>>>> root@gfstst-node2:~$ sudo gluster peer status >>>>>>>>>>>>> Number of Peers: 2 >>>>>>>>>>>>> >>>>>>>>>>>>> Hostname: IP.3 >>>>>>>>>>>>> Uuid: d6bf51a7-c296-492f-8dac-e81efa9dd22d >>>>>>>>>>>>> State: Peer in Cluster (Disconnected) >>>>>>>>>>>>> >>>>>>>>>>>>> Hostname: IP.4 >>>>>>>>>>>>> Uuid: c1cbb58e-3ceb-4637-9ba3-3d28ef20b143 >>>>>>>>>>>>> State: Peer in Cluster (Connected) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> regards >>>>>>>>>>>>> Amudhan >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>> Gluster-users@gluster.org >>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> >>>>>>>>>>>>
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users