Hi, It can be that LIO service starts before /mnt gets mounted. In absence of backend file LIO has created the new one on root filesystem (/mnt directory). Then gluster volume was mounted over, but as backend file was kept open by LIO - it still was used instead of the right one on gluster volume. Then, when you turn off the first node - active path for iSCSI disk switches to the second node (with empty file, placed on root filesystem).
> 18 нояб. 2016 г., в 19:21, Olivier Lambert <[email protected]> > написал(а): > > After Node 1 is DOWN, LIO on Node2 (iSCSI target) is not writing > anymore in the local Gluster mount, but in the root partition. > > Despite "df -h" shows the Gluster brick mounted: > > /dev/mapper/centos-root 3,1G 3,1G 20K 100% / > ... > /dev/xvdb 61G 61G 956M 99% /bricks/brick1 > localhost:/gv0 61G 61G 956M 99% /mnt > > If I unmount it, I still see the "block.img" in /mnt which is filling > the root space. So it's like Fuse is messing with the local Gluster > mount, which could lead to the data corruption on the client level. > > It doesn't make sense for me... What am I missing? > > On Fri, Nov 18, 2016 at 5:00 PM, Olivier Lambert > <[email protected]> wrote: >> Yes, I did it only if I have the previous result of heal info ("number >> of entries: 0"). But same result, as soon as the second Node is >> offline (after they were both working/back online), everything is >> corrupted. >> >> To recap: >> >> * Node 1 UP Node 2 UP -> OK >> * Node 1 UP Node 2 DOWN -> OK (just a small lag for multipath to see >> the path down and change if necessary) >> * Node 1 UP Node 2 UP -> OK (and waiting to have no entries displayed >> in heal command) >> * Node 1 DOWN Node 2 UP -> NOT OK (data corruption) >> >> On Fri, Nov 18, 2016 at 3:39 PM, David Gossage >> <[email protected]> wrote: >>> On Fri, Nov 18, 2016 at 3:49 AM, Olivier Lambert <[email protected]> >>> wrote: >>>> >>>> Hi David, >>>> >>>> What are the exact commands to be sure it's fine? >>>> >>>> Right now I got: >>>> >>>> # gluster volume heal gv0 info >>>> Brick 10.0.0.1:/bricks/brick1/gv0 >>>> Status: Connected >>>> Number of entries: 0 >>>> >>>> Brick 10.0.0.2:/bricks/brick1/gv0 >>>> Status: Connected >>>> Number of entries: 0 >>>> >>>> Brick 10.0.0.3:/bricks/brick1/gv0 >>>> Status: Connected >>>> Number of entries: 0 >>>> >>>> >>> Did you run this before taking down 2nd node to see if any heals were >>> ongoing? >>> >>> Also I see you have sharding enabled. Are your files being served sharded >>> already as well? >>> >>>> >>>> Everything is online and working, but this command give a strange output: >>>> >>>> # gluster volume heal gv0 info heal-failed >>>> Gathering list of heal failed entries on volume gv0 has been >>>> unsuccessful on bricks that are down. Please check if all brick >>>> processes are running. >>>> >>>> Is it normal? >>> >>> >>> I don't think that is a valid command anymore as whern I run it I get same >>> message and this is in logs >>> [2016-11-18 14:35:02.260503] I [MSGID: 106533] >>> [glusterd-volume-ops.c:878:__glusterd_handle_cli_heal_volume] 0-management: >>> Received heal vol req for volume GLUSTER1 >>> [2016-11-18 14:35:02.263341] W [MSGID: 106530] >>> [glusterd-volume-ops.c:1882:glusterd_handle_heal_cmd] 0-management: Command >>> not supported. Please use "gluster volume heal GLUSTER1 info" and logs to >>> find the heal information. >>> [2016-11-18 14:35:02.263365] E [MSGID: 106301] >>> [glusterd-syncop.c:1297:gd_stage_op_phase] 0-management: Staging of >>> operation 'Volume Heal' failed on localhost : Command not supported. Please >>> use "gluster volume heal GLUSTER1 info" and logs to find the heal >>> information. >>> >>>> >>>> On Fri, Nov 18, 2016 at 2:51 AM, David Gossage >>>> <[email protected]> wrote: >>>>> >>>>> On Thu, Nov 17, 2016 at 6:42 PM, Olivier Lambert >>>>> <[email protected]> >>>>> wrote: >>>>>> >>>>>> Okay, used the exact same config you provided, and adding an arbiter >>>>>> node (node3) >>>>>> >>>>>> After halting node2, VM continues to work after a small "lag"/freeze. >>>>>> I restarted node2 and it was back online: OK >>>>>> >>>>>> Then, after waiting few minutes, halting node1. And **just** at this >>>>>> moment, the VM is corrupted (segmentation fault, /var/log folder empty >>>>>> etc.) >>>>>> >>>>> Other than waiting a few minutes did you make sure heals had completed? >>>>> >>>>>> >>>>>> dmesg of the VM: >>>>>> >>>>>> [ 1645.852905] EXT4-fs error (device xvda1): >>>>>> htree_dirblock_to_tree:988: inode #19: block 8286: comm bash: bad >>>>>> entry in directory: rec_len is smaller than minimal - offset=0(0), >>>>>> inode=0, rec_len=0, name_len=0 >>>>>> [ 1645.854509] Aborting journal on device xvda1-8. >>>>>> [ 1645.855524] EXT4-fs (xvda1): Remounting filesystem read-only >>>>>> >>>>>> And got a lot of " comm bash: bad entry in directory" messages then... >>>>>> >>>>>> Here is the current config with all Node back online: >>>>>> >>>>>> # gluster volume info >>>>>> >>>>>> Volume Name: gv0 >>>>>> Type: Replicate >>>>>> Volume ID: 5f15c919-57e3-4648-b20a-395d9fe3d7d6 >>>>>> Status: Started >>>>>> Snapshot Count: 0 >>>>>> Number of Bricks: 1 x (2 + 1) = 3 >>>>>> Transport-type: tcp >>>>>> Bricks: >>>>>> Brick1: 10.0.0.1:/bricks/brick1/gv0 >>>>>> Brick2: 10.0.0.2:/bricks/brick1/gv0 >>>>>> Brick3: 10.0.0.3:/bricks/brick1/gv0 (arbiter) >>>>>> Options Reconfigured: >>>>>> nfs.disable: on >>>>>> performance.readdir-ahead: on >>>>>> transport.address-family: inet >>>>>> features.shard: on >>>>>> features.shard-block-size: 16MB >>>>>> network.remote-dio: enable >>>>>> cluster.eager-lock: enable >>>>>> performance.io-cache: off >>>>>> performance.read-ahead: off >>>>>> performance.quick-read: off >>>>>> performance.stat-prefetch: on >>>>>> performance.strict-write-ordering: off >>>>>> cluster.server-quorum-type: server >>>>>> cluster.quorum-type: auto >>>>>> cluster.data-self-heal: on >>>>>> >>>>>> >>>>>> # gluster volume status >>>>>> Status of volume: gv0 >>>>>> Gluster process TCP Port RDMA Port Online >>>>>> Pid >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> Brick 10.0.0.1:/bricks/brick1/gv0 49152 0 Y >>>>>> 1331 >>>>>> Brick 10.0.0.2:/bricks/brick1/gv0 49152 0 Y >>>>>> 2274 >>>>>> Brick 10.0.0.3:/bricks/brick1/gv0 49152 0 Y >>>>>> 2355 >>>>>> Self-heal Daemon on localhost N/A N/A Y >>>>>> 2300 >>>>>> Self-heal Daemon on 10.0.0.3 N/A N/A Y >>>>>> 10530 >>>>>> Self-heal Daemon on 10.0.0.2 N/A N/A Y >>>>>> 2425 >>>>>> >>>>>> Task Status of Volume gv0 >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> There are no active volume tasks >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Nov 17, 2016 at 11:35 PM, Olivier Lambert >>>>>> <[email protected]> wrote: >>>>>>> It's planned to have an arbiter soon :) It was just preliminary >>>>>>> tests. >>>>>>> >>>>>>> Thanks for the settings, I'll test this soon and I'll come back to >>>>>>> you! >>>>>>> >>>>>>> On Thu, Nov 17, 2016 at 11:29 PM, Lindsay Mathieson >>>>>>> <[email protected]> wrote: >>>>>>>> On 18/11/2016 8:17 AM, Olivier Lambert wrote: >>>>>>>>> >>>>>>>>> gluster volume info gv0 >>>>>>>>> >>>>>>>>> Volume Name: gv0 >>>>>>>>> Type: Replicate >>>>>>>>> Volume ID: 2f8658ed-0d9d-4a6f-a00b-96e9d3470b53 >>>>>>>>> Status: Started >>>>>>>>> Snapshot Count: 0 >>>>>>>>> Number of Bricks: 1 x 2 = 2 >>>>>>>>> Transport-type: tcp >>>>>>>>> Bricks: >>>>>>>>> Brick1: 10.0.0.1:/bricks/brick1/gv0 >>>>>>>>> Brick2: 10.0.0.2:/bricks/brick1/gv0 >>>>>>>>> Options Reconfigured: >>>>>>>>> nfs.disable: on >>>>>>>>> performance.readdir-ahead: on >>>>>>>>> transport.address-family: inet >>>>>>>>> features.shard: on >>>>>>>>> features.shard-block-size: 16MB >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> When hosting VM's its essential to set these options: >>>>>>>> >>>>>>>> network.remote-dio: enable >>>>>>>> cluster.eager-lock: enable >>>>>>>> performance.io-cache: off >>>>>>>> performance.read-ahead: off >>>>>>>> performance.quick-read: off >>>>>>>> performance.stat-prefetch: on >>>>>>>> performance.strict-write-ordering: off >>>>>>>> cluster.server-quorum-type: server >>>>>>>> cluster.quorum-type: auto >>>>>>>> cluster.data-self-heal: on >>>>>>>> >>>>>>>> Also with replica two and quorum on (required) your volume will >>>>>>>> become >>>>>>>> read-only when one node goes down to prevent the possibility of >>>>>>>> split-brain >>>>>>>> - you *really* want to avoid that :) >>>>>>>> >>>>>>>> I'd recommend a replica 3 volume, that way 1 node can go down, but >>>>>>>> the >>>>>>>> other >>>>>>>> two still form a quorum and will remain r/w. >>>>>>>> >>>>>>>> If the extra disks are not possible, then a Arbiter volume can be >>>>>>>> setup >>>>>>>> - >>>>>>>> basically dummy files on the third node. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Lindsay Mathieson >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> [email protected] >>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> [email protected] >>>>>> http://www.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> >>> >>> > _______________________________________________ > Gluster-users mailing list > [email protected] > http://www.gluster.org/mailman/listinfo/gluster-users -- Дмитрий Глушенок Инфосистемы Джет +7-910-453-2568
_______________________________________________ Gluster-users mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-users
