Hi, So it looks like Satheesaran managed to recreate this issue. We will be seeking his help in debugging this. It will be easier that way.
-Krutika On Tue, Mar 21, 2017 at 1:35 PM, Mahdi Adnan <[email protected]> wrote: > Hello and thank you for your email. > Actually no, i didn't check the gfid of the vms. > If this will help, i can setup a new test cluster and get all the data you > need. > > Get Outlook for Android <https://aka.ms/ghei36> > > From: Nithya Balachandran > Sent: Monday, March 20, 20:57 > Subject: Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption > To: Krutika Dhananjay > Cc: Mahdi Adnan, Gowdappa, Raghavendra, Susant Palai, > [email protected] List > > Hi, > > Do you know the GFIDs of the VM images which were corrupted? > > Regards, > > Nithya > > On 20 March 2017 at 20:37, Krutika Dhananjay <[email protected]> wrote: > > I looked at the logs. > > From the time the new graph (since the add-brick command you shared where > bricks 41 through 44 are added) is switched to (line 3011 onwards in > nfs-gfapi.log), I see the following kinds of errors: > > 1. Lookups to a bunch of files failed with ENOENT on both replicas which > protocol/client converts to ESTALE. I am guessing these entries got > migrated to > > other subvolumes leading to 'No such file or directory' errors. > > DHT and thereafter shard get the same error code and log the following: > > 0 [2017-03-17 14:04:26.353444] E [MSGID: 109040] > [dht-helper.c:1198:dht_migration_complete_check_task] > 17-vmware2-dht: <gfid:a68ce411-e381-46a3-93cd-d2af6a7c3532>: failed > to lookup the file on vmware2-dht [Stale file handle] > > > 1 [2017-03-17 14:04:26.353528] E [MSGID: 133014] > [shard.c:1253:shard_common_stat_cbk] 17-vmware2-shard: stat failed: > a68ce411-e381-46a3-93cd-d2af6a7c3532 [Stale file handle] > > which is fine. > > 2. The other kind are from AFR logging of possible split-brain which I > suppose are harmless too. > [2017-03-17 14:23:36.968883] W [MSGID: 108008] > [afr-read-txn.c:228:afr_read_txn] 17-vmware2-replicate-13: Unreadable > subvolume -1 found with event generation 2 for gfid > 74d49288-8452-40d4-893e-ff4672557ff9. (Possible split-brain) > > Since you are saying the bug is hit only on VMs that are undergoing IO > while rebalance is running (as opposed to those that remained powered off), > > rebalance + IO could be causing some issues. > > CC'ing DHT devs > > Raghavendra/Nithya/Susant, > > Could you take a look? > > -Krutika > > > On Sun, Mar 19, 2017 at 4:55 PM, Mahdi Adnan <[email protected]> > wrote: > > Thank you for your email mate. > > Yes, im aware of this but, to save costs i chose replica 2, this cluster > is all flash. > > In version 3.7.x i had issues with ping timeout, if one hosts went down > for few seconds the whole cluster hangs and become unavailable, to avoid > this i adjusted the ping timeout to 5 seconds. > > As for choosing Ganesha over gfapi, VMWare does not support Gluster (FUSE > or gfapi) im stuck with NFS for this volume. > > The other volume is mounted using gfapi in oVirt cluster. > > > > -- > > Respectfully > *Mahdi A. Mahdi* > > *From:* Krutika Dhananjay <[email protected]> > *Sent:* Sunday, March 19, 2017 2:01:49 PM > > *To:* Mahdi Adnan > *Cc:* [email protected] > *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption > > > > While I'm still going through the logs, just wanted to point out a couple > of things: > > 1. It is recommended that you use 3-way replication (replica count 3) for > VM store use case > > 2. network.ping-timeout at 5 seconds is way too low. Please change it to > 30. > > Is there any specific reason for using NFS-Ganesha over gfapi/FUSE? > > Will get back with anything else I might find or more questions if I have > any. > > -Krutika > > On Sun, Mar 19, 2017 at 2:36 PM, Mahdi Adnan <[email protected]> > wrote: > > Thanks mate, > > Kindly, check the attachment. > > -- > > Respectfully > *Mahdi A. Mahdi* > > *From:* Krutika Dhananjay <[email protected]> > *Sent:* Sunday, March 19, 2017 10:00:22 AM > > *To:* Mahdi Adnan > *Cc:* [email protected] > *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption > > > > In that case could you share the ganesha-gfapi logs? > > -Krutika > > On Sun, Mar 19, 2017 at 12:13 PM, Mahdi Adnan <[email protected]> > wrote: > > I have two volumes, one is mounted using libgfapi for ovirt mount, the > other one is exported via NFS-Ganesha for VMWare which is the one im > testing now. > > -- > > Respectfully > *Mahdi A. Mahdi* > > *From:* Krutika Dhananjay <[email protected]> > *Sent:* Sunday, March 19, 2017 8:02:19 AM > > *To:* Mahdi Adnan > *Cc:* [email protected] > *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption > > > > On Sat, Mar 18, 2017 at 10:36 PM, Mahdi Adnan <[email protected]> > wrote: > > Kindly, check the attached new log file, i dont know if it's helpful or > not but, i couldn't find the log with the name you just described. > > > No. Are you using FUSE or libgfapi for accessing the volume? Or is it NFS? > > > > -Krutika > > -- > > Respectfully > *Mahdi A. Mahdi* > > *From:* Krutika Dhananjay <[email protected]> > *Sent:* Saturday, March 18, 2017 6:10:40 PM > > *To:* Mahdi Adnan > *Cc:* [email protected] > *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption > > > > mnt-disk11-vmware2.log seems like a brick log. Could you attach the fuse > mount logs? It should be right under /var/log/glusterfs/ directory > > named after the mount point name, only hyphenated. > > -Krutika > > On Sat, Mar 18, 2017 at 7:27 PM, Mahdi Adnan <[email protected]> > wrote: > > Hello Krutika, > > Kindly, check the attached logs. > > -- > > Respectfully > *Mahdi A. Mahdi* > > *From:* Krutika Dhananjay <[email protected]> > > *Sent:* Saturday, March 18, 2017 3:29:03 PM > *To:* Mahdi Adnan > *Cc:* [email protected] > *Subject:* Re: [Gluster-users] Gluster 3.8.10 rebalance VMs corruption > > > > Hi Mahdi, > > Could you attach mount, brick and rebalance logs? > > -Krutika > > On Sat, Mar 18, 2017 at 12:14 AM, Mahdi Adnan <[email protected]> > wrote: > > Hi, > > I have upgraded to Gluster 3.8.10 today and ran the add-brick procedure in > a volume contains few VMs. > > After the completion of rebalance, i have rebooted the VMs, some of ran > just fine, and others just crashed. > > Windows boot to recovery mode and Linux throw xfs errors and does not boot. > > I ran the test again and it happened just as the first one, but i have > noticed only VMs doing disk IOs are affected by this bug. > > The VMs in power off mode started fine and even md5 of the disk file did > not change after the rebalance. > > anyone else can confirm this ? > > Volume info: > > > > Volume Name: vmware2 > > Type: Distributed-Replicate > > Volume ID: 02328d46-a285-4533-aa3a-fb9bfeb688bf > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 22 x 2 = 44 > > Transport-type: tcp > > Bricks: > > Brick1: gluster01:/mnt/disk1/vmware2 > > Brick2: gluster03:/mnt/disk1/vmware2 > > Brick3: gluster02:/mnt/disk1/vmware2 > > Brick4: gluster04:/mnt/disk1/vmware2 > > Brick5: gluster01:/mnt/disk2/vmware2 > > Brick6: gluster03:/mnt/disk2/vmware2 > > Brick7: gluster02:/mnt/disk2/vmware2 > > Brick8: gluster04:/mnt/disk2/vmware2 > > Brick9: gluster01:/mnt/disk3/vmware2 > > Brick10: gluster03:/mnt/disk3/vmware2 > > Brick11: gluster02:/mnt/disk3/vmware2 > > Brick12: gluster04:/mnt/disk3/vmware2 > > Brick13: gluster01:/mnt/disk4/vmware2 > > Brick14: gluster03:/mnt/disk4/vmware2 > > Brick15: gluster02:/mnt/disk4/vmware2 > > Brick16: gluster04:/mnt/disk4/vmware2 > > Brick17: gluster01:/mnt/disk5/vmware2 > > Brick18: gluster03:/mnt/disk5/vmware2 > > Brick19: gluster02:/mnt/disk5/vmware2 > > Brick20: gluster04:/mnt/disk5/vmware2 > > Brick21: gluster01:/mnt/disk6/vmware2 > > Brick22: gluster03:/mnt/disk6/vmware2 > > Brick23: gluster02:/mnt/disk6/vmware2 > > Brick24: gluster04:/mnt/disk6/vmware2 > > Brick25: gluster01:/mnt/disk7/vmware2 > > Brick26: gluster03:/mnt/disk7/vmware2 > > Brick27: gluster02:/mnt/disk7/vmware2 > > Brick28: gluster04:/mnt/disk7/vmware2 > > Brick29: gluster01:/mnt/disk8/vmware2 > > Brick30: gluster03:/mnt/disk8/vmware2 > > Brick31: gluster02:/mnt/disk8/vmware2 > > Brick32: gluster04:/mnt/disk8/vmware2 > > Brick33: gluster01:/mnt/disk9/vmware2 > > Brick34: gluster03:/mnt/disk9/vmware2 > > Brick35: gluster02:/mnt/disk9/vmware2 > > Brick36: gluster04:/mnt/disk9/vmware2 > > Brick37: gluster01:/mnt/disk10/vmware2 > > Brick38: gluster03:/mnt/disk10/vmware2 > > Brick39: gluster02:/mnt/disk10/vmware2 > > Brick40: gluster04:/mnt/disk10/vmware2 > > Brick41: gluster01:/mnt/disk11/vmware2 > > Brick42: gluster03:/mnt/disk11/vmware2 > > Brick43: gluster02:/mnt/disk11/vmware2 > > Brick44: gluster04:/mnt/disk11/vmware2 > > Options Reconfigured: > > cluster.server-quorum-type: server > > nfs.disable: on > > performance.readdir-ahead: on > > transport.address-family: inet > > performance.quick-read: off > > performance.read-ahead: off > > performance.io-cache: off > > performance.stat-prefetch: off > > cluster.eager-lock: enable > > network.remote-dio: enable > > features.shard: on > > cluster.data-self-heal-algorithm: full > > features.cache-invalidation: on > > ganesha.enable: on > > features.shard-block-size: 256MB > > client.event-threads: 2 > > server.event-threads: 2 > > cluster.favorite-child-policy: size > > storage.build-pgfid: off > > network.ping-timeout: 5 > > cluster.enable-shared-storage: enable > > nfs-ganesha: enable > > cluster.server-quorum-ratio: 51% > > Adding bricks: > > gluster volume add-brick vmware2 replica 2 gluster01:/mnt/disk11/vmware2 > gluster03:/mnt/disk11/vmware2 gluster02:/mnt/disk11/vmware2 > gluster04:/mnt/disk11/vmware2 > > starting fix layout: > > gluster volume rebalance vmware2 fix-layout start > > Starting rebalance: > > gluster volume rebalance vmware2 start > > > -- > > Respectfully > *Mahdi A. Mahdi* > > _______________________________________________ > Gluster-users mailing list > [email protected] > http://lists.gluster.org/mailman/listinfo/gluster-users > > > > > > > > > >
_______________________________________________ Gluster-users mailing list [email protected] http://lists.gluster.org/mailman/listinfo/gluster-users
